Intro to Neural Networks

Matthew S. Shotwell, Ph.D.

Department of BiostatisticsVanderbilt University School of Medicine

Nashville, TN, USA

April 8, 2020

Neural Networks

NN is a nonlinear model, often represented by network diagram:

Neural Networks

I NN in previous diagram is for a K class problem

I output layer has K nodes, one for each class

I NN for regression would have just one output node

Neural Networks

Model formula for NN in previous figure; K class problem:

I input layer: P features, x1 · · · , xPI hidden layer: for each hidden node m = 1, . . . ,M

zm = σ(α0m + α1mx1 + · · ·+ αPmxP )

I output layer: for each output node k = 1, . . . ,K

tk = s(β0k + β1kz1 + · · ·+ βMkzM )

I σ is an “activation” function

I s is a “link” function, e.g., logit, or softmax function

Neural Networks

zm = σ(α0m + α1mx1 + · · ·+ αPmxP )

tk = s(β0k + β1kz1 + · · ·+ βMkzM )

I layers are “fully connected” to lower layer

I all nodes in lower layer contribute to each node of layer above

Neural Networks

zm = σ(α0m + α1mx1 + · · ·+ αPmxP )

tk = s(β0k + β1kz1 + · · ·+ βMkzM )

I α0m and β0k called “bias” parameters

I # params: hidden M × (P + 1); output K × (M + 1)

σ Activation function

σ is an “activation function” designed to mimic the behavior ofneurons in propagating signals in the (human) brain. Theactivation function makes the model nonlinear (in the parameters)

I sigmoid - σ(x) = 11+e−x

I ReLU - σ(x) = max(0, x)

I ReLU - “rectified linear unit”

I ReLU - faster training vs sigmoid, may perform better

s Link function

The link function transforms the output nodes so that they makesense for the type of probelm: classification or regression.

I regression - identity link - s(t) = t

I classification - softmax link -

s(tk) =etk∑Kl=1 e

I s(tk) is a value between 0 and 1

I s(tk) is P (yk = 1|x) where yk is target code for class k

Training (fitting) neural networks

I NN’s often have large number of parameters: θ

I fit NN’s by minimizing average loss in training data!

I Regression: err =∑N

i=1(yi − f(xi, θ))2

I Classificatin: err = −∑N

∑Kk=1 yik log fk(xi, θ), where yik

is the indicator for class k, and fk gives the probability forclass k. This is called the “cross-entropy” or “deviance”

I other loss functions can be used

I err minimized w.r.t. θ using a gradient descent algorithmcalled “back-propagation” or “backprop”

I backprop iterates two steps:

I forward step: fix θ and compute values of hidden and outputnodes zm and tk, and apply link function to get predictionsf(xi) (probabilities for classification, or numerical predictionfor regression)

I backward step: fix zm, tk, f(xi) and update θ using agradient descent step

I usually don’t want global minimum of err due to overfitting

I # of iters of backprop, learning rate (of gradient descentalgorithm), and shrinkage penalties (weight decay, dropout)are tuning parameters

Shrinkage:

I weight decay: modified objective

err(θ) + λJ(θ)

whereJ(θ) =

like a ridge penalty, λ is tuning parameter

I dropout: at each round of training, some of the hidden orinput nodes are Zm or Xm are ignored (assigned a valuezero); ignored inputs are selected at random at each round ofbackprop; number of ignored features is tuning parameter

Simple NN in R: nnet.R

Extending NN’s

I The real power of NN’s comes through various extensions:

I Additional hidden layers

I Modifying connectivity between layers

I Processing between layers

Additional layers

More than one hidden layer:

Modified connectivity

I local connectivity: hidden units only receive input from asubset of “local” units in the layer below; not “fully”connected

I say there are 3 hidden nodes and 9 features

z1 = σ(α01 + α11x1 + α21x2 + α31x3)

z2 = σ(α02 + α12x4 + α22x5 + α32x6)

z3 = σ(α03 + α13x7 + α23x8 + α33x9)

I each hidden node linked to just 3 of 9 features

I hidden layer just 3× 4 = 12 parameters instead of 30

I output layer typically always fully connected

Local connectivity

I local connectivity: there can be some “overlap”: some hiddennodes can take input from some of the same featuresw

z1 = σ(α01 + α11x1 + α21x2 + α31x3)

z2 = σ(α02 + α12x3 + α22x4 + α32x5 + α42x6)

z3 = σ(α03 + α13x6 + α23x7 + α33x8 + α43x9)

Local connectivity

I weight sharing: some hidden units share weights

I only makes sense in combination with local connectivity

z1 = σ(α01 + α1x1 + α2x2 + α3x3)

z2 = σ(α02 + α1x4 + α2x5 + α3x6)

z3 = σ(α03 + α1x7 + α2x8 + α3x9)

I typically each hidden node retains a distinct bias (α0m)

I hidden layer has just 3 + 3 = 6 parameters

I number of links 6= number of parameters (more links)

Local connectivity + weight sharing

Example: zipcode data

I hand-written integers

I output: 10-class classification

I input: 16x16 B&W image

I when input is an array (2D in this case), typically the hiddenunits in a hidden layer are represented as an array too

I figure below shows local connectivity with 5× 5 “kernel”

Local connectivityI 3× 3 kernelI stride 1I edge handling

Local conn. w/weight sharing (convolution)

Convolution is type of shape detector or “feature map”:

I Net 1: no hidden layer, same as multinomial logisticregression, (256+1)10 = 2570 parameters

I Net 2: one hidden layer, 12 units, fully connected, (256+1)12+ (12+1)10 = 3214 parameters

I Net 3: two hidden layers, locally connected, 1226 parameters

I Net 4: two hidden layers, locally connected, weight sharing,1132 parameters, 2266 links

Local connectivity

Local connectivity and shared weights

AKA: convolutional neural networks

I groups of hidden units form “shape detectors” or “featuremaps”

I more complex shape detectors near output layer

Performance on zipcode data

Processing between layers

I Pooling/subsampling: down-sample data from a layer bysummarizing of a group of units

I Max-pooling: summarize using maximum:

Deep learning and deep neural networks

I Deep learning uses deep NNs

I Deep NNs are simply NNs with many layers, complexconnectivity, and processing steps between layers:

Complex NNs in R

I No (good) native R libraries for complex NNs

I R can interface to good libraries, e.g., Keras, TensorFlow

I See https://keras.rstudio.com/

Intro to Neural Networks - Vanderbilt Biostatistics...

Documents

Artificial Neural Networks: Intro - Teaching Labscsc411h/winter/lec/week3/n… · · 2018-03-22Artificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter

Neural Networks: Optimization Part 1 · Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1. Story so far •Neural networks are universal approximators –Can

Artificial Neural Networks - Department of Computer ...guerzhoy/321/lec/W03/neuralnetworks.pdfArtificial Neural Networks CSC321: Intro to Machine Learning and Neural Networks, Winter

Matthew S. Shotwell · 2017. 9. 21. · Matthew S. Shotwell Vanderbilt University Department of Biostatistics August 16, 2011. What is a TTY? A TTY, or computer terminal is a two-way

Introduction to Arti cial Neural Networkselektronn.org/downloads/Intro-ANN.pdf · · 2018-02-12Introduction to Arti cial Neural Networks ... Depending on the model parameters, this

KNN, ID Trees, and Neural Nets Intro to Learning Algorithms · PDF fileKNN-ID and Neural Nets KNN, ID Trees, and Neural Nets Intro to Learning Algorithms KNN, Decision trees, Neural

Non-Conventional Success Triggers + NCBOs = College Ready Students Jenny Shotwell Ellen Falkenstein

Intro to Neural Networkslxmls.it.pt/2018/Lecture.fin.pdf · What’s in this tutorial •We will learn about –What is a neural network: historical perspective –What can neural

Neural Networks. - cvut.cz · Relations to neural networks Intro •Notation •Multiple regression •Logistic regression •Question •Gradient descent •Ex: Grad. for MR •Ex:

Shotwell - Reflections

Jordan Shotwell - Alternative Care Options for Diabetic and Prediabetic Patients

Vision Processing and Neural Networks - Khronos Group · •Intro and OpenCL ... Embedded/Mobile Vision/Inferencing Hardware Embedded/Mobile Vision/Inferencing Hardware Neural Net

Improving Gradient Descent Learning in Neural Networksguerzhoy/321/lec/W10/sgd.pdf · 2016. 3. 24. · Improving Gradient Descent Learning in Neural Networks CSC321: Intro to Machine

San Diego Venture Group; Venture Summit 2013; Shotwell

Artificial Neural Networks: Intro

Intro to Neural Networks and Deep Learningjjl5sw/documents/DeepLearningIntro.pdf · Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 UVA CS 6316. Neurons

Artificial Neural Networks: Introguerzhoy.mycpanel.princeton.edu/310f18/lec/W04/neuralnetworks.pdf · Artificial Neural Networks: Intro SML310: Research Projects in Data Science,

Artificial Neural Networks: Introguerzhoy/411/lec/W03/neuralnetworks.pdfArtificial Neural Networks: Intro CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy “Making

Keras Intro ADACS - OzGrav · 2019. 3. 21. · •To employ Neural Networks for learning over time sequences of data, we use a Recurring Neural Network (RNN). •A Recurrent neural

CS224d Deep NLP Lecture 6: Neural Tips and Tricks ... · Neural Tips and Tricks + Recurrent Neural Networks ... • First intro to Recurrent Neural Networks ... through the parameters