Neural Networks Compression Techniques for Succinct ...learned.di.unipi.it/docs/2nd_meeting/neural_networks_compression... · PRIN17-Task2-Deliverable2.1 [T2-D2.1]CompressMLmodels-DevelopmentofcompressedML

Neural Networks Compression Techniques for SuccinctClassification Models

PRIN 17 Kick off meeting - 06.02.20

Dipartimento Informatica - Unità UNIMI

Componenti UnitàMarco Frasca RTD B (responsabile)Giorgio Valentini PODario Malchiodi PAMarco Mesiti PAPaolo Perlasca RU

PRIN17 - Task 2 - Deliverable 2.1

[T2 - D2.1] Compress ML models - Development of compressed MLmodels, by investigating the trade-off between compressed space,compressed time and the prediction accuracy

The aim of our unit is to evaluate the capability of differentcompression techniques for Neural Network (NN) models to reducethe model space requirements while possibly preserving (or evenimproving) its accuracy• Pruning• Weight sharing• Sparsified NN training• Knowledge distillation• ...• Combined approaches

Results might change with the input problemNN models for different categories of problems: multi-class/binaryclassification, regression, dimensionality reduction and so on

Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 2/19

(Deep) Neural Network

Preliminary definitionsINPUT: n input examples (x1, y1), (x2, y2), . . . , (xn, yn), xi ∈ Rm,

yi ∈ Y ⊂ Rd label of xi , i ∈ {1, 2, . . . , n}

m, d > 0 integer, number of input/output neurons (input/outputdimension)

Parameters: Θ = (W ,b,θ)W network connections (feed-forward, convolutional, etc.)b hidden and output neurons biasesθ other possible model parameters

Activation function: gθ: transform input I(h)k of neuron k at layer h

into the corresponding output o(h)k = gθ(I(h)

k )

Network function: φΘ : Rm → Y , associating input x with itsprediction φΘ(x), estimating the true label function f : Rm → Y ,such that f (xi ) = yi , i ∈ {1, 2, . . . , n}


Network Training - Backpropagation

BackpropagationAim: learning (W ,b) to minimize the loss (error) function LΘE.g., quadratic loss function

LΘ(x1, . . . , xn, y1, . . . , yn) :=n∑

i=1

d∑k=1

(φΘ(xi )k − yik

)2Algorithm: used to find a local minimum of LΘ, Θ∗ = arg min

ΘLΘ

1. Feed-forward computation: compute o(h)i = gθ(o(h−1)

i W (h)) (input xi ,o(0)

i = xi) 1 ≤ h ≤ H (b embedded in W ), thus

LΘ =n∑

i=1

d∑k=1

(o(H)

ik − yik)2

2. Back-propagate the error :∂LΘ∂W (H)

jr=

n∑i=1

∂LΘ∂o(H)

ir

∂o(H)ir

∂W (H)jr

=n∑

i=12(o(H)

ir − yir ) ∂o(H)ir

∂W (H)jr

:=n∑

i=1δ

(H)ir o(H−1)

ij

3. Update parameters: ∆W (l)jr = −ηo(l−1)

ij δ(l)ir


Network Pruning

Compressing the network

Once W ∗ has been computed, this technique simply set a thresholdτ ∈ [0, 100) integer and then set to 0 all the weights lower that theτ -th percentile of weight distribution

Retrain the network by clamping weights set to 0, obtaining W

The compression rate γP depends on the storage of W

Other possibility: sparsifying during the training...


Weight SharingNN compression technique [Han et a.l 2016] which performs ak-clustering of connection W ∗, thus having only k distinct weightsc1, c2, . . . ck , correspoding to the centroids of the k computed clustersC1, . . . ,Ck

Retrain the network by adopting a backpropagation cumulative acrossclusters:

∂LΘ∂Ck

=∑j,r

∂LΘ∂Wjr

∂Wjr∂Ck

=∑j,r

∂LΘ∂Wjr

1(Sjr = k)

where Sjr = q if Wjr belongs to cluster Cq, q ∈ {1, . . . , k}

If k < 256, it requires 1 byte for each entry of S, and 1 float (4 or 8bytes) for each ck

Limitation: we cannot use exactly dlog2 ke bits for representingentries of S, but 1, 2, 4 or 8 bytes


Pruning + Weight Sharing

A combined compression strategy can be adopted: Training +Pruning + WS + Re-training

Learning issues: the way of storing the sparse matrix might conflictwith the way centroids and index matrix S are storedComputing the network function φΘ can be slowered

Ongoing study (collaboration with UNIPO): utilizing arepresentation suitable for sparse matrices containing repeated values

- The aim is to get a compression function ψ(τ, k) depending on thenumber of clusters used and on the pruning threshold

- Then, given the compression ratio desired ρ, compute the optimal NNmodel having such compression ratio, that is (τ , k) =arg max

τ,k,ρ≤ψ(τ,k)A(τ, k), where A(τ, k) is the testing accuracy of the NN

model compressed using τ ad pruning percentile and k as number ofWS clusters


T2 - D2.1. Experiment 1 - MNIST Digits Classification

Problem: Prediction of handwritten digits images from MNISTInput: 786-dimensional vector obtained by vectorizing the 28× 28 matrixrepresenting the input grayscale image. Each cell is a value in [0, 255]representing the black level: 0 =white, 255 =black. 60000 images fortraining and 10000 images for testing

– Classes are almost balanced [5421 (0) to 6742 (1) training examples]

– Targets are 10-dimensional vectors t, with ti = 1 if the image is digit i , 0 otherwise

Optput: real valued vector of y size 10 representing the prediction for oneof the digits (softmax)Model: a Feed-Forward neural network with variable number of hiddenlayers and neurons, 786 input neurons and 10 output neurons.

– Activation functions: sigmoid or ReLU, for hidden and output layers– learnig rate and batch size have been tuned

Other configurations: τ , percentile of pruning; ki number of distinctweights at layer i for the wheight sharing technique


T2 - D2.1. Experiment 1, results

Tabella: Just pruning. Acc: test accuracy. τ ; pruning percentile. k: number ofclusters.

H.Neurons τ k Base Model Compr. Mod

Acc Train time (s) Acc Train time

[200] 10 - 0.982 919 0.982 32[200] 20 - 0.982 59[200] 25 - 0.982 115[200] 30 - 0.982 144[200] 40 - 0.982 349[200] 50 - 0.983 789[200] 60 - 0.982 1675[200] 70 - 0.981 2772[200] 75 - 0.982 3139[200] 80 - 0.981 2963[200] 90 - 0.978 2633


T2 - D2.1. Experiment 1, results 2

Tabella: Weight Sharing. Acc: test accuracy. τ : pruning percentile. k: numberof clusters.



[200] - [200 50] 0.982 919 0.982 191[200] - [150 50] 0.982 223[200] - [100 50] 0.981 247[200] - [50 50] 0.981 351[200] - [50 25] 0.982 264[200] - [25 50] 0.982 284[200] - [25 25] 0.980 1498[200] - [25 10] 0.981 710[200] - [10 25] 0.978 133[200] - [10 10] 0.979 109


T2 - D2.1. Experiment 1, results 3Tabella: Pruning + WS. Acc: test accuracy. τ : pruning percentile. k: number of clusters.



[200] 25 [100 50] 0.982 919 0.981 442[200] 50 [100 50] 0.981 4760[200] 75 [100 50] 0.974 4746

[200] 50 [50 50] 0.980 6177[200] 50 [150 50] 0.981 4768[200] 50 [200 50] 0.981 4115[200] 50 [100 100] 0.980 4919[200] 50 [150 100] 0.980 4738[200] 50 [100 150] 0.980 4807[200] 50 [25 25] 0.980 3965[200] 50 [50 25] 0.981 5992[200] 50 [25 50] 0.980 5757[200] 50 [10 10] 0.981 1776[200] 50 [25 10] 0.981 5715[200] 50 [10 25] 0.979 2246

[200] 75 [10 10] 0.965 455[200] 75 [25 10] 0.971 1430[200] 75 [50 10] 0.972 3210[200] 75 [500 50] 0.976 5713[200] 75 [750 50] 0.976 5704[200] 75 [750 100] 0.977 5671


PRIN17 - UNIMI Short- Mid-term objectives

[T2 - D2.1].Jointly with UNIPO, extensively testing the succinct representationfor the sparse matrices obtained after the application of pruning +WS, with a particular focus on the query efficiency as well (onmultiple data sets and learning problems)

Studying appropriate optimization procedures for selecting the model,among those guaranteeing the required compression

Investigating other approaches to make sparse the weight matrix, e.g.adopting loss regularization terms which sparsify connections duringmodel training

Spanning the literature to find other promising compressiontechniques, even for a massive comparison


PRIN17 - UNIMI Short- Mid-term objectives

We can also give a contribution to

[T1 - D1.2] A collection of known and new implementations of ML-basedand compressed data structures, implementing the procedures justdescribed.Such implementations would be used in the next tasks T3 and T4 as“building blocks” for the multicriteria framework

[T2 - D2.2] A collection of software prototypes for succinct ML-models

Possible other colloborations are welcome



[T1 - D1.1] Compare classic Data Structures vs Purely Learned Indexes -Characterization of the space-time-accuracy performance of ML models interms of distribution of the input data, and their comparison against theknown (compressed) data structures

Collaboration with UNIPA unitPreliminary experiments about learned indexes with neural networks (NN) models[UNIPA] + NN compression techniques [UNIMI] to reduce the space occupancy + theircomparison with some SOA indexes [UNIPA+UNIMI]PROBLEM INPUT: set of integers X sorted [extendable to other type of elements]PROBLEM OUTPUT: a learned indexed for searching elements in x ∈ X (given x , returnpos(x) its position in X)STEP 1: Coding the input for feeding the NN [e.g. binary representation]STEP 2: Estimate the empirical cumulative distribution F of data with respect to X , andconsider as training examples the couples (x ∈ X , F (x))STEP 3: NN training, φ will be an estimate of FSTEP 4 [optional]: NN compressionSTEP 5: Index validation



Adopted modelsFeed Forward neural net with 0 (NN0) 1 (NN1) or 2 (NN2) hiddenlayers

dlog2 qe (q = maxx∈X x) input and 1 output units, and a variablenumber of hidden units (when present)

Sigmoid activation for the output unit, sigmoid or ReLU for hiddenunits. Quadratic loss function.Data: three uniform distributions of integers of size 29 (Data1), 213(Data2), and 220 (Data3)

Inference:– compute φ(x) for each x ∈ X– compute the error e(x) := |φ(x)|X | − pos(x)|,where pos(x) is the position of x in the input sorted sequence


PRIN17 - T1 - D1.1, preliminary results

Tabella: Hidden layers have 256 units. Between parenthesis the best result withcompression.

Model Training Time (s) max e (emax)/|X| (emean)/|X|)

Data 1 - size 5.12 · 102

NN2 37 4 (2) 0.78 1.83 · 10−3 (6.28 · 10−4)NN1 25 3 (2) 0.78 1.60 · 10−3 (5.43 · 10−4)NN0 0.59 9 (8) 1.7 · 10−2 4.74 · 10−3 (4.58 · 10−3)

Data 2 - size 8.192 · 103

NN2 42.3 35 (32) 0.48 1.14 · 10−3 (8.18 · 10−4)NN1 35.8 33 (33) 0.4 1.12 · 10−3 (5.82 · 10−4)NN0 13.7 53 (43) 0.64 1.62 · 10−3 (1.52 · 10−3)

Data 3 - size 1.048 · 109

NN2 470 1270 (660) 0.12 2.4 · 10−4 (1.45 · 10−4)NN1 438 1031 (292) 9.0 · 10−2 2.33 · 10−4 (5.11 · 10−5)NN0 267 905 (706) 8.6 · 10−2 1.85 · 10−4 (1.82 · 10−4)


PRIN17 - T1 - D1.1, preliminary results 2Extending the learned index in the sense of Kraska et al. 2017Divide the input sequences in s splits and learn a NN0 model for eachModel s max e

Data 1 - size 5.12 · 102

NN2 1 4NN1 1 3NN0 1 9NN0 2 6, 7NN0 4 3, 5, 8, 7NN0 8 3, 4, 6, 3, 3, 4, 3, 10

Data 2 - size 8.192 · 103

NN2 1 35NN1 1 33NN0 1 53

NN0 2 38, 22NN0 4 25, 43, 156, 24NN0 8 17, 25, 28, 27, 15, 20, 20, 19

Model s max e

Data 3 - size 1.048 · 109

NN2 1 1270NN1 1 1031NN0 1 905

NN0 2 615, 743NN0 4 550, 625, 757, 516NN0 8 568, 328, 299, 327, 229, 319, 296, 386

Best results: maximum value testeds = 128 (if possible)

Data 1, s = 20, emax = 1Data 2, s = 72, emax = 9Data 3, s = 112, emax = 99

Note: the larger s, the more the spaceoccupied


PRIN17 - Task 1 - D1.1: Ongoing research

The problem is that, in this setting, we aim at minimizing themaximum error, but the quadratic loss function is not a continuousapproximation of the maximum function. When ei is the error oversample xi (d = 1), each weight Wjr is updated of an amountproportional to

∂LΘ

∂W (H)jr

=n∑

i=1

∂LΘ

∂o(H)ir

∂o(H)ir

∂W (H)jr

=n∑

i=12ei

∂o(H)ir

∂W (H)jr

In principle, we need a loss function emphasizing the larger errors,while neglecting the smaller ones

There are some possible solutions for loss function approximating themaximum function Lmax (e1, . . . , en) = max

i∈{1,...,n}ei

We are now running some experiments by using some loss functionsapproximating Lmax


PRIN17 - Task 1 - D1.1: Short- Mid-term objectives

[T1 - D1.1]. Short- Mid-term objectives

Once terminated this investigation:

Actually we can relate the expected space occupancy to thecompression rate, but we aim (possibly) at relating (empirically atleast) compression rate with the maximum error

Characterize the effectiveness of NN compression techniques withdifferent data distributions

Comparing with other ‘classical’ indexes (B-trees, B+ trees, etc.)(UNIPA is already working on that)


Documents

Neural Networks Compression Techniques for Succinct ...learned.di.unipi.it/docs/2nd_meeting/neural_networks_compression... · PRIN17-Task2-Deliverable2.1 [T2-D2.1]CompressMLmodels-DevelopmentofcompressedML