A Practical Guide to Deep Learning at the Department of ......Start prosess which aren’t shut down...

Preview:

Citation preview

A Practical Guide to Deep Learning at theDepartment of Mathematics

Vegard Antun (UiO)

March 19, 2019

1 / 61

Layout of the talk

Part I Computer resources, the linux operating system, large scalecomputations.

Part II Neural networks, mathematical framework, practical example.

2 / 61

Computer resources

CPU

Cache

Memory

Hard drive

3 / 61

INF1060, Pål Halvorsen University of Oslo

cache(s)

main memory

secondary storage (disks)

tertiary storage (tapes)

Memory Hierarchies

0.3 ns

On die memory - 1 ns

50 ns

5 ms

< 1 s

2 s

1.5 minutes

3.5 months

Computer resources

GPU

Memory

CPU

Cache

Memory

Hard drive

5 / 61

Time measurements

Total time for 10 epochs on CIFAR10. Batch size 10.

I CPU: 8 min, 35 sec

I GPU: 53 sec (≈10 times faster)

Network Local disk RAM

0

5

10

15

20

Seco

nds

Loading 50 MR scans (each 40 MB) on nam shub

6 / 61

Operating systems (OS)

Hardware

Operating system

7 / 61

The Linux Filesystem Hierarchy

The uppermost directory in the Linux file system is /

[ ∼ ]$ ls

Desktop Downloads Pictures www_docs

Documents pc WINDOWS

[ ∼ ]$ pwd

/mn/sarpanitu/ansatte -u4/vegarant

[ ∼ ]$ cd /

[ / ]$ ls

admin etc lib misc opt sbin tf usit

bin hf lib64 mn proc site tmp usr

boot home local mnt rh srv ub uv

dev ifi med net root sv uio var

div jus media odont run sys use

8 / 61

Some important directories

I /bin Most basic executable files (ls, cp, cd)

I /lib Libraries used by the executables

I /boot Files related to the boot loader

I /dev All devices, /dev/random, /dev/null, /dev/pst/0

I /etc Configuration files, /etc/hostname, /etc/passwd

I /home/username Your home folder ∼/ (not on UiO-system)

I /root Home directory of root user

I /tmp Temporary files - Not preserved during reboots

I /usr Read-only user data. Multiuser applications

I /var Variable files, i.e. files which changes during execution

9 / 61

Environment variables

Variable with a name and a value, used by one or moreapplications. To view all type env

Some important environment variables

I PATH All directories where we search for executables

I PYTHONPATH All directories where we search for python modules

I HOME Your home directory i.e. the position of ∼/

I EDITOR Default editor

I TF_CPP_MIN_LOG_LEVEL Level of verbosity for tensorflow

11 / 61

Environment variables - Example

[ ∼ ]$ echo $PYTHONPATH

/path/to/module1 :/path/to/module2

[ ∼ ]$

[ ∼ ]$ export PYTHONPATH=$PYTHONPATH :/path/to/new_module

[ ∼ ]$

[ ∼ ]$ echo $PYTHONPATH

/path/to/module1 :/path/to/module2 :/path/to/new_module

12 / 61

The ∼/.bashrc

The scrip language you type in the terminal is called “BASH“(Bourne Again SHell)

We often want the environment to stay persistent between logins.Set defaults in the files

I ∼/.bashrc Run each time you open a terminal on yourcomputer

[ ∼ ]$ cat ∼/. bashrc

export PYTHONPATH=$PYTHONPATH :/path/to/new_module

export TF_CPP_MIN_LOG_LEVEL =1

alias la=’ls -a --color=auto ’

alias ll=’ls -lh --color=auto ’

# Describes the command line prompt

PS1=’[ \h \w ]$ ’

13 / 61

The ∼/.bashrc and ∼/.bash profile files

I ∼/.bashrc Run each time you open a terminal on yourcomputer

I ∼/.bash_profile Run each time you log in remotely.

To have two different settings in ∼/.bashrc and ∼/.bash profile isoften inconvenient. To only use the ∼/.bashrc file, place thefollowing lines in your ∼/.bash profile

[ ∼ ]$ cat .bash_profile

if [ -f ∼/. bashrc ]; then

. ∼/. bashrc

fi

Note: Files starting with ’.’ don’t show whenever you type ls. Inorder to see these files, type ls -a

14 / 61

Login to remote machines via SSH

Login to the universities network from a personal linux or maccomputer

[ ∼ ]$ ssh -X username@login.math.uio.no

The -X options enabels X11 forwarding i.e. you can open GUIbased applications.

Once you are logged in you can continue to the desired computerby typing

[ ∼ ]$ ssh -X computername

[ ∼ ]$ # Example , logging into the hadad computer

[ ∼ ]$ ssh -X hadad

15 / 61

Login to remote machines via SSH

Next we will see how to make this preceedure require less typing!

16 / 61

SSH config file

Create the file ∼/.ssh/config and add the following lines

host uio

hostname login.math.uio.no

user your_username

ForwardX11 no

You can then logon to the university’s network by

ssh -X uio

We assume you have this setup in the rest of this presentation

17 / 61

SSH keys

To make the UiO passwords secure they often require a lot oftyping. SSH-keys provides an easy way to maintain high sequretywhile having shorter passwords.

18 / 61

Generate and set up SSH-key

[ ∼ ]$ ssh -keygen -t rsa -b 4096 -C "your@email.com"

This command will create two files

I ∼/.ssh/id_rsa Private key. Do not share it.

I ∼/.ssh/id_rsa.pub Public key. Can be shared with anyone.

Copy the public key to the remote host (UIO)

ssh−copy−i d − i ∼/ . s s h / i d r s a . pub <username>@ l o g i n . math . u i o . no

19 / 61

SSH and jump connections

Your comp. login.math.uio.no math comp.

I Jump connection sends the ssh trafic directly through acomputer like a regular ruter

I You avoid some typing and you do not allocate a terminal onthe jump computer

I Does only allow for one jump

20 / 61

SSH and jump connections

To use jump connection add the following to your ∼/.ssh/config

# Setup for the math computers , this example belet -ili

Host belet -ili1

Hostname belet -ili.uio.no

ProxyJump vegarant@login.math.uio.no

User vegarant

or you can add the jump connection directly

s s h −J <username>@ l o g i n . math . u i o . no <username>@<hostname>. u i o . no

21 / 61

Terminal window managers

I Common choices are “tmux“ or “screen“.

22 / 61

Monitor CPU usage

I Use the htop command to view CPU-usage and priority

23 / 61

Reducing the priority of your process

I Linux processes can have “niceness“ values {−20, . . . , 19}where a smaller value gives higher priority.

I Negative nice values can only be given by rootuser/administrator.

I The default priority of any process you start will be 0 i.e. youwill typicaly reduce the priority.

[ ∼ ]$ nice -n 19 python3 my_python_script.py &

24 / 61

Monitor GPU usage

I All of our GPUs are from Nvidia. To view their current usageuse nvidia-smi

I To call this command every 5 second use the watch command

[ ∼ ]$ watch -n 5 nvidia -smi

[ ∼ ]$ # or use

[ ∼ ]$ nvidia -smi -l 5

25 / 61

GPU resources at Dep. of Mathematics

Name GPU CPU cores Mem. scratchnam-shub-01 4 × RTX 2080 ti 28 128GB 30GB

zadkiel 1 × RTX 2080 4 16 GB −belet-ili 1 × GTX 1080 4 16 GB −cleopatra 1 × GTX 1080 4 16 GB −euphrosyne 1 × GTX 1080 4 16 GB −hadad 1 × GTX 1080 4 16 GB −

26 / 61

AI HUB

I An experimental service for machine learning provided byUSIT, to gain experience with hardware and software for deeplearning.

I Reserved for students on weekdays (Mon-Fri) from 09:00 to17:00.

I Need to login via Abel (add ssh keys as before).

Name GPU CPU cores Mem.

Nonepresistent

scratchml1 4 × RTX 2080 Ti 28 128 GB 17TB

ml2 4 × RTX 2080 Ti 28 128 GB 17TB

ml3 4 × RTX 2080 Ti 28 128 GB 17TB

I AI mailing list: itf-ai-announcements@usit.uio.no

28 / 61

Deep learning frameworks

I Many old frameworks like: MatConvNet, Caffe, Theano ...

I For most scientists Tensorflow (and maybe Pytorch) would bethe prefered option.

29 / 61

Tensorflow

I Developed by Google, and have a large community.

I Relatively well documented

I Have APIs in Python, JavaScript, C++, Java, Go, Swift.

I Models can be deployed into applications, such as websitesand phones.

30 / 61

How to run Tensorflow?

I No unified way to do this on all systems.

I The machines ml1, ml2 and ml3, have tensorflow v1.12 andPyTorch v1.0. Just type python3 to get started.

I On math computers we use the module system (and maybesingularity)

module avail # See which modules are avaiable

module load tensorflow/<version > # Load tensorflow

module rm tensorflow/<version > # Unload tensorflow

module list # view loaded modules

I ML software located under python-ml/<version> andtensorflow/<version>. Do not load both.

31 / 61

Singularity

I Singularity (similar to docker) is container with a minimaloperating system.

I Shares the kernel with the host operating system so that CPUoverhead is almost non.

I You can install whatever software you like within thecontainer, with the nessesary libaries.

I Makes reproducible research much easier!

I Check out Tormod Landet’s excelent guide to singularityhttp://folk.uio.no/tormodla/singularity/

I On maths computers precompiled singularity images arelocated at /mn/sarpanitu/singularity/images/Machine_learning

32 / 61

Neat commands

I ag or ack – Search for pattern in each source file in the treefrom the current directory and downward.

I fzf – Fuzzy finder. Search for filenames in the tree from thecurrent directory and downwards.

I which <command> – E.g. which python Gives the location of theprogram python.

I nohup nice -n 19 python -u my_script.py > output.txt & –Start prosess which aren’t shut down when you exit the loginshell.

33 / 61

File permissions

On UNIX systems, access can be given to a user, group or all. Thetree types of permissions are read, write and execute

[ ∼/some/directory]$ ls -l

drwxrwxr -x. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir

-rwxrwxr -x. 1 vegarant vegarant 8448 Oct 26 10:53 my_file

-rw -r--r--. 1 vegarant vegarant 108 Oct 26 10:52 my_file.c

d︸︷︷︸directory

rwx︸︷︷︸user

rwx︸︷︷︸group

r − x︸ ︷︷ ︸all

vegarant︸ ︷︷ ︸username

vegarant︸ ︷︷ ︸group name

4096︸︷︷︸size

Oct26 10 : 53︸ ︷︷ ︸last modified

my dir︸ ︷︷ ︸name

[ ∼/some/directory]$ # Make directory private

[ ∼/some/directory]$ chmod 700 my_dir

[ ∼/some/directory]$ ls -l

drwx ------. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir

-rwxrwxr -x. 1 vegarant vegarant 8448 Oct 26 10:53 my_file

-rw -r--r--. 1 vegarant vegarant 108 Oct 26 10:52 my_file.c

34 / 61

Part II

Neural networks, mathematical framework, practical example.

35 / 61

Neural Network

Definition 1Let NNN,L,d with N = (c = NL+1,NL, . . . ,N2,N1 = d) denote theset of all L-layer neural networks. That is, all mappingsf : Rd → Rc of the form

f (x) = WL(. . . ρ(W2(ρ(W1(x)))) . . .), x ∈ Rd ,

where Wjz = Ajz + bj , Aj ∈ RNj×Nj+1 , bj ∈ RNj+1

and ρ : R→ R is a non-linear function that acts elementwise on a vector.

37 / 61

Choices of ρ

ρ : R→ R acts elementwise on a vector.

Sigmoid: ρ(x) = 1/(1 + e−x) ReLu: ρ(x) = max(0, x)

tanh: ρ(x) = tanh(x) Leaky ReLu: ρ(x) =

{x x ≥ 0

αx x < 038 / 61

Choices of ρ

ρ

x1...xN

=

max{x1, x2}...

max{xN−1, xN}

, ρ

x1...xN

=

x1+x2

2...

xN−1+xN2

Max pooling Avrage pooling (linear map)

39 / 61

Neural Network (Alternative definition)

Directed acyclic graph

x

z1 = A1x + b1

z2 = ρ1(z1)

z3 = A2z2 + b2

z4 = A3x + b3

z5 = ρ2(z4)z6 = z3 + z5

z7 = ρ3(z6)

Output40 / 61

What is machine learning?

41 / 61

Machine learning model

I Training set: S = (z1, . . . , zm) ⊂ Z where each zi is i.i.d.from an unknown probability distribution D over Z ⊂ Rd .

I Function class: F class of funtions/hypotheses.

I Cost function: C : F × Z → RI Risk: RD(f ) := Ez∼DC (f , z) where z ∼ D is independent of

S .

I Goal: Find a “good hypotesis“ f̂ ∈ F based on S such thatRD(f̂ ) is small.

Shalev-Shwartz & Ben-David, Understanding Machine Learning: From Theory

to Algorithms, Cambridge University Press, 2014.

42 / 61

Examples

Binary classificationI Training set: {(xi , yi )}mi=1 ⊂ Rd × {0, 1}.I Function class: F can be set of linear classifiers, Neural

networks, decision trees.I Cost function: C (f , (xi , yi )) = 1{yi=f (xi )}.

Linear regressionI Training set: {(xi , yi )}mi=1 ⊂ Rd × R.I Function class: F = {〈·, θ〉 : θ ∈ Rd+1}I Cost function: C (f , (xi , yi )) = (yi − 〈[xi , 1], θ〉)2.

ClusteringI Training set: S = {zi}mi=1 ⊂ Rd .I Function class:

F = {T = {T1, . . . ,Tk} : Partition of S with centers (c1, . . . , ck)}

I Cost function: C (T , zi ) = ||zi − cj || for zi ∈ Tj .

43 / 61

Machine learning model

I Risk: RD(f ) := Ez∼DC (f , z) where z ∼ D is independent ofS .

I Goal: Find a “good hypotesis“ f̂ ∈ F based on S such thatRD(f̂ ) is small. Notice: We can not evaluate RD(f ) since D isunknown

Emperical Risk Minimazation

Approximate RD(f ) by

RS(f ) =1

|S |∑z∈S

C (f , z)

We seek to findf ] ∈ argminf ∈F RS(f )

44 / 61

Bias-Complexity tradeoff

Let

εapprox = minf ∈F

RD(f ) and f ] ∈ argminf ∈F RS(f ).

Then

RD(f ]) = εapprox︸ ︷︷ ︸approximation error

+RD(f ])− εapprox︸ ︷︷ ︸estimation error

45 / 61

Emperial Risk Minimization for Neural Networks

I Training set: {(xi , yi )}mi=1 ⊂ Rd × Rc .

I Function class: F = NNN,L,d parametrized by the weightsθ = (vec(A1), b1, . . . , vec(AL), bL) i.e. f (·, θ) : Rd → RNL+1 .

I Cost function: C (f , (xi , yi )) = d(f (xi , θ), yi ). Functiond : Rc × Rc → R+ problem dependent.

1. θ ∈ Rp is often referred to as the weights.

2. Define loss function

L(θ) =n∑

i=1

d(f (xi , θ), yi )

3. Try to findθ ∈ argmin

θ∈RpL(θ)

using (stochastic) gradient decent.

46 / 61

Convex Optimization – Boyd & Vandenberghe

“Nonlinear optimization (or nonlinear programming) is the termused to describe an optimization problem when the objective orconstraint functions are not linear, but not known to be convex.Sadly, there are no effective methods for solving the generalnonlinear programming problem (1.1). Even simple lookingproblems with as few as ten variables can be extremely challenging,while problems with a few hundreds of variables can be intractable.Methods for the general nonlinear programming problem thereforetake several different approaches, each of which involves somecompromise.“

minimize f0(x), x ∈ Rn

subject to fi (x) ≤ bi i = 1, . . . ,m(1.1)

Boyd & Vandenberghe, Convex Optimization, Cambridge universitypress, 2004.

47 / 61

Convex Optimization – Boyd & Vandenberghe

From section on local optimization approaches to nonlinearoptimization:

“Roughly speaking, local optimization methods are more art thantechnology. Local optimization is a well developed art, and oftenvery effective, but it is nevertheless an art.“

Boyd & Vandenberghe, Convex Optimization, Cambridge universitypress, 2004.

48 / 61

Gradient Decent for Neural Networks

I Recall we wanted to minimize

L(θ) =n∑

i=1

d(f (xi , θ), yi )

Gradient decent gives the iterations

θk+1 = θk − αk∇L(θk)

for some step length αk > 0.

I What happens to the computational cost if n is very large, sayn ≈ 1 200 000

49 / 61

Stochastic Gradient Decent for Neural Networks

I Create a partition {T1, . . . ,Tk} of the numbers {1, . . . , n}where each |Tj | ≤ s.

I LetGj(θ) =

∑i∈Tj

∇θC (f (xi , θ), yi )

I Perform the updates

1: t = 02: for e = 1, . . . ,M do3: for j = 1, . . . , k do4: θt+1 = θt − αtGj(θt)5: t = t + 1;

6: return θkM

50 / 61

Alternative update rules

GD with momentum, 0 < γ < 1.

vt+1 = γvt + ηGj(θt)

θt+1 = θt − vt+1

Individual scaling of the different parameters. (Adagrad, RMSprop,Adam)

θt+1 = θt − DtGj(θt)

Dt is a diagonal matrix depending on some or all of the previouscomptued gradients.

51 / 61

Tensorflow

import tensorflow as tf

import numpy as np

Most important tensors

I tf.Variable (Must be initialized. Can take gradient)

I tf.placeholder (Input to the network)

I tf.constant (Constant values)

I tf.Tensor (Output of an operation)

Important Attributes

I shape (Default is None, i.e. not specified)

I dtype (tf.float32, tf.int32, . . .)

I name (Will be assigned a name of not specified)

52 / 61

xA

z1 = Ax b

z2 = z1 + b

I A: tf.Variable

I x : tf.placeholder

I z1: tf.Tensor

I b: tf.Variable, tf.placeholder or tf.constant

I z2: tf.Tensor

53 / 61

Tensorflow

# Nodes in a graph

a = tf.Variable(initial_value=np.random.randn(1,3),

name=’weights ’, dtype=tf.float32)

b = tf.Variable(initial_value =[0], name=’bias ’,

dtype=tf.float32)

print(a)

print(b)

$ python3 program_name.py

<tf.Variable ’weights:0’ shape =(1, 3) dtype=float32_ref >

<tf.Variable ’bias:0’ shape =(1,) dtype=float32_ref >

54 / 61

Linear regression

# Code generating all the data

N = 50

a_true = np.array ([[4., -5, 3 ]], dtype=np.float32)

b_true = np.array ([2], dtype=np.float32)

x_data = np.concatenate( (np.random.randn(1, N),

np.random.uniform(size=[1, N]),

np.random.chisquare(df=3.0, size=(1, N))) )

noise = 0.01*np.random.randn(1, N)

labels = np.dot(a_true , x_data) + b_true # + noise

a =

4−53

, b = 2, xi ∈ R3, i = 1, . . . ,N

xTi a + b = yi , i = 1, . . . ,N

55 / 61

Tensorflow

# Nodes in a graph

a = tf.Variable(initial_value=np.random.randn(1,3),

name=’weights ’, dtype=tf.float32)

b = tf.Variable(initial_value =[0], name=’bias ’,

dtype=tf.float32)

X = tf.placeholder(dtype=tf.float32 , name=’data ’,

shape =[3, N])

prediction = tf.linalg.matmul(a,X) + b # TF graph

print(x)

print(prediction)

$ python3 program_name.py

Tensor ("data:0", shape =(3, 50), dtype=float32)

Tensor ("add:0", shape =(1, 50), dtype=float32)

56 / 61

Tensorflow – Sessions

I Graphs only define the function you would like to compute.I To execute a graph (function), open a tf.Session().

init = tf.global_variables_initializer ();

with tf.Session () as sess:

sess.run(init); # All variables must be initalized

# All relevant placeholders goes into the feed_dict

pred = sess.run(prediction , feed_dict ={X: x_data })

a_start = sess.run(a);

print(a_start );

print(pred) # pred is a numpy array with

# values = a*data + b

$ python3 program_name.py

[[ -0.9025026 0.6354202 -0.09739944]]

[[ -0.86136425 0.6985589 0.51153713 1.2961135

...

0.91275173 -1.0157912 -0.41740212 0.45071918

0.3727951 -0.81552047]]

57 / 61

Tensorflow – Gradient Decent

Y = tf.placeholder(dtype=tf.float32 , name=’label ’,

shape =[1, N]);

# Compute sum_{i} (y[i]-prediction[i])^2

loss = tf.reduce_sum(tf.pow(prediction -Y, 2));

nbr_epochs = 100;

step_length = 0.01; # often called learning rate

optimizer = tf.train.GradientDescentOptimizer(

step_length ). minimize(loss);

with tf.Session () as sess:

sess.run(init); # All variables must be initalized

for epoch in range(nbr_epochs ):

# Do gradient decent step

sess.run(optimizer , feed_dict ={X: x_data ,

Y: labels })

a_pred , b_pred = sess.run([a, b]);58 / 61

NeurIPS (earlier NIPS)

Submitted papers

I 2016: 2406 submissions

I 2017: 3240 submissions

I 2018: ∼4900 submissions

Source: Twitter

59 / 61

Recommended