51
Statistical Natural Language Processing Recurrent and convolutional networks Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2020

Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Statistical Natural Language ProcessingRecurrent and convolutional networks

Çağrı Çöltekin

University of TübingenSeminar für Sprachwissenschaft

Summer Semester 2020

Page 2: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Deep neural networks

x1 xm…

• Deep neural networks (>2 hidden layers)have recently been successful in many tasks

• They often use sparse connectivity andshared weights

• We will focus on two importantarchitectures: recurrent and convolutionalnetworks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 1 / 27

Page 3: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 4: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 5: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations

• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 6: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why deep networks?

• We saw that a feed-forward network with a single hidden layer is a universalapproximator

• However, this is a theoretical result – it is not clear how many units one mayneed for the approximation

• Successive layers may learn different representations• Deeper architectures have been found to be useful in many tasks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 2 / 27

Page 7: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Why now?

• Increased computational power, especially advances in graphical processingunit (GPU) hardware

• Availability of large amounts of data– mainly unlabeled data (more on this later)– but also labeled data through ‘crowd sourcing’ and other sources

• Some new developments in theory and applications

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 3 / 27

Page 8: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

• Feed forward networks– can only learn associations– do not have memory of earlier inputs: they cannot handle sequences

• Recurrent neural networks are ANN solution for sequence learning• This is achieved by recursive loops in the network

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 4 / 27

Page 9: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks

• They include loops that use previous output (of the hidden layers) as well asthe input

• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 10: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks• They include loops that use previous output (of the hidden layers) as well as

the input

• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 11: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Recurrent neural networks

x1

h1

x2

h2

x3

h3

x4

h4

y

• Recurrent neural networks are similar to the standard feed-forward networks• They include loops that use previous output (of the hidden layers) as well as

the input• Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 5 / 27

Page 12: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

A simple version: SRNsElman (1990)

InputContext units

Hidden units

Output units

copy

• The network keeps previous hiddenstates (context units)

• The rest is just like a feed-forwardnetwork

• Training is simple, but cannot learnlong-distance dependencies

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 6 / 27

Page 13: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 14: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

not

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 15: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

really

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 16: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

worth

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 17: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Processing sequences with RNNs

• RNNs process sequences one unit at a time• The earlier inputs affect the output through recurrent links

h1 h2 h3 h4

y

seeing

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 7 / 27

Page 18: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Learning in recurrent networks

x

h(1)

y(1)

W0

W1

W2

• We need to learn three sets of weights: W0, W1 andW2

• Backpropagation in RNNs are at first not thatobvious

• The main difficulty is in propagating the errorthrough the recurrent connections

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 8 / 27

Page 19: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unrolling a recurrent networkBack propagation through time (BPTT)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(0) y(1)

…y(t−1) y(t)

Note: the weights with the same color are shared.

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 9 / 27

Page 20: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unstable gradients

• A common problem in deep networks is unstable gradients• The patial derivatives with respect to weights in the early layers calculated

using the chain rule• A long chain of multiplications may result in

– vanishing gradients if the values are in range (−1, 1)– exploding gradients if absolute values larger than 1

• A practical solution for exploding gradients is called gradient clipping• Solution to vanishing gradients is more involved (coming soon)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 10 / 27

Page 21: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-many (e.g., POS tagging)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(0) y(1)

…y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 22: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-one (e.g., document classification)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 23: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

RNN architecturesMany-to-many with a delay (e.g., machine translation)

x(0) x(1) … x(t−1) x(t)

h(0) h(1) … h(t−1) h(t)

y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 11 / 27

Page 24: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Bidirectional RNNs

x(t−1) x(t) x(t+1)

y(t−1) y(t) y(t−1)

Forward states … …

Backward states … …

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 12 / 27

Page 25: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Unstable gradients revisited

• We noted earlier that the gradients may vanish or explode duringbackpropagation in deep networks

• This is especially problematic for RNNs since the effective dept of the networkcan be extremely large

• Although RNNs can theoretically learn long-distance dependencies, this isaffected by unstable gradients problem

• The most popular solution is to use gated recurrent networks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 13 / 27

Page 26: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Gated recurrent networks

σf σi tanh σo

×

×

× +

tanh

c(t-1)

h(t-1)

c(t)

h(t)

x(t)

• Most modern RNN architectures are ‘gated’• The main idea is learning a mask that controls what to remember (or forget)

from previous hidden layers• Two popular architectures are

– Long short term memory (LSTM) networks (above)– Gated recurrent units (GRU)

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 14 / 27

Page 27: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolutional networks

• Convolutional networks are particularly popular in image processingapplications

• They have also been used with success some NLP tasks• Unlike feed-forward networks we have discussed so far,

– CNNs are not fully connected– The hidden layer(s) receive input from only a set of neighboring units– Some weights are shared

• A CNN learns features that are location invariant• CNNs are also computationally less expensive compared to fully connected

networks

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 15 / 27

Page 28: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 29: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 30: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 31: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 32: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 33: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 34: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 35: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 36: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in image processing• Convolution is a common operation in image processing for effects like edge

detection, blurring, sharpening, …• The idea is to transform each pixel with a function of the local neighborhood

Input (X) Filter (W) Output (Y)

y =∑i

wixi

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 16 / 27

Page 37: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Example convolutions

• Blurring

1

16

1 2 1

2 4 2

1 2 1

• Edge detection −1 −1 −1

−1 8 −1

−1 −1 −1

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 17 / 27

Page 38: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Learning convolutions

• Some filters produce features that are useful for classification (e.g., of images,or sentences)

• In machine learning we want to learn the convolutions• Typically, we learn multiple convolutions, each resulting in a different feature

map• Repeated application of convolutions allow learning higher level features• The last layer is typically a standard fully-connected classifier

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 18 / 27

Page 39: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Convolution in neural networks

x1 x2

h2

x3

h3

x4

h4

x5

w-1 w

0

w1

w1

w0

w-1

w1

w0

w1

• Each hidden layer corresponds to a local window in the input• Weights are shared: each convolution detects the same type of features

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 19 / 27

Page 40: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling

x1

h1

x2

h2

h′1

x3

h3

h′2

x4

h4

h′3

x5

h5

Conv

olut

ion

Pooling

• Convolution is combined with pooling• Pooling ‘layer’ simply calculates a statistic (e.g., max) over the convolution

layer• Location invariance comes from pooling

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 20 / 27

Page 41: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

1 3 2 5 2

3 5 5

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 42: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

2 1 3 2 5

3 3 5

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 43: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Pooling and location invariance

4 2 1 3 2

4 3 3

Convolution

Max pooling

• Note that the numbers at the pooling layer are stable in comparison to theconvolution layer

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 21 / 27

Page 44: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Padding in CNNs

• With successive layers ofconvolution and pooling, thelater layers shrink

• One way to avoid this is paddingthe input and hidden layerswith enough number of zeros

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 22 / 27

Page 45: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Padding in CNNs

• With successive layers ofconvolution and pooling, thelater layers shrink

• One way to avoid this is paddingthe input and hidden layerswith enough number of zeros

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 22 / 27

Page 46: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

CNNs: the bigger picture

• At each convolution/pooling step, we oftenwant to learn multiple feature maps

• After a (long) chain of hierarchical featuremaps, the final layer is typically afully-connected layer (e.g., softmax forclassification)

Convolution

Pooling...

Fully connected

classifier output

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 23 / 27

Page 47: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Real-world examples are complex

input

Conv

7x7+2(S)

MaxPool

3x3+2(S)

LocalRespNorm

Conv

1x1+1(V)

Conv

3x3+1(S)

LocalRespNorm

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

DepthConcat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

AveragePool

7x7+1(V)

FC

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax0

Conv

1x1+1(S)

FC

FC

Softm

axActiv

atio

n

softm

ax1

Softm

axActiv

atio

n

softm

ax2

The real-world ANNs tend to be complex• Many layers (sometimes with repetition)• Large amount of branching

* Diagram describes an image classification network, GoogLeNet (Szegedy et al. 2014).

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 24 / 27

Page 48: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

CNNs in natural language processing

• The use of CNNs in image applications is rather intiutive– the first convolutional layer learns local features, e.g., edges

– successive layers learn more complex features that are combinations of thesefeatures

• In NLP, it is a bit less straight-forward– CNNs are typically used in combination with word vectors– The convolutions of different sizes correspond to (word) n-grams of different

sizes– Pooling picks important ‘n-grams’ as features for classification

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 25 / 27

Page 49: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

An example: sentiment analysis

not really worth seeingInput

Word vectors

Convolution

Feature maps

Pooling

Features

Classifier

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 26 / 27

Page 50: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Summary

• Deep networks use more than one hidden layer• Common (deep) ANN architectures include:CNN shared feed-forward weights, location invarianceRNN sequence learning

Next:• N-gram language models

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 27 / 27

Page 51: Statistical Natural Language Processing - Recurrent and ...DeepANNs RNNs CNNs Whydeepnetworks? • Wesawthatafeed-forwardnetworkwithasinglehiddenlayerisauniversal approximator •

Deep ANNs RNNs CNNs

Summary

• Deep networks use more than one hidden layer• Common (deep) ANN architectures include:CNN shared feed-forward weights, location invarianceRNN sequence learning

Next:• N-gram language models

Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 27 / 27