Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Neural Networks Compression Techniques for SuccinctClassification Models
PRIN 17 Kick off meeting - 06.02.20
Dipartimento Informatica - Unità UNIMI
Componenti UnitàMarco Frasca RTD B (responsabile)Giorgio Valentini PODario Malchiodi PAMarco Mesiti PAPaolo Perlasca RU
PRIN17 - Task 2 - Deliverable 2.1
[T2 - D2.1] Compress ML models - Development of compressed MLmodels, by investigating the trade-off between compressed space,compressed time and the prediction accuracy
The aim of our unit is to evaluate the capability of differentcompression techniques for Neural Network (NN) models to reducethe model space requirements while possibly preserving (or evenimproving) its accuracy• Pruning• Weight sharing• Sparsified NN training• Knowledge distillation• ...• Combined approaches
Results might change with the input problemNN models for different categories of problems: multi-class/binaryclassification, regression, dimensionality reduction and so on
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 2/19
(Deep) Neural Network
Preliminary definitionsINPUT: n input examples (x1, y1), (x2, y2), . . . , (xn, yn), xi ∈ Rm,
yi ∈ Y ⊂ Rd label of xi , i ∈ {1, 2, . . . , n}
m, d > 0 integer, number of input/output neurons (input/outputdimension)
Parameters: Θ = (W ,b,θ)W network connections (feed-forward, convolutional, etc.)b hidden and output neurons biasesθ other possible model parameters
Activation function: gθ: transform input I(h)k of neuron k at layer h
into the corresponding output o(h)k = gθ(I(h)
k )
Network function: φΘ : Rm → Y , associating input x with itsprediction φΘ(x), estimating the true label function f : Rm → Y ,such that f (xi ) = yi , i ∈ {1, 2, . . . , n}
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 3/19
Network Training - Backpropagation
BackpropagationAim: learning (W ,b) to minimize the loss (error) function LΘE.g., quadratic loss function
LΘ(x1, . . . , xn, y1, . . . , yn) :=n∑
i=1
d∑k=1
(φΘ(xi )k − yik
)2Algorithm: used to find a local minimum of LΘ, Θ∗ = arg min
ΘLΘ
1. Feed-forward computation: compute o(h)i = gθ(o(h−1)
i W (h)) (input xi ,o(0)
i = xi) 1 ≤ h ≤ H (b embedded in W ), thus
LΘ =n∑
i=1
d∑k=1
(o(H)
ik − yik)2
2. Back-propagate the error :∂LΘ∂W (H)
jr=
n∑i=1
∂LΘ∂o(H)
ir
∂o(H)ir
∂W (H)jr
=n∑
i=12(o(H)
ir − yir ) ∂o(H)ir
∂W (H)jr
:=n∑
i=1δ
(H)ir o(H−1)
ij
3. Update parameters: ∆W (l)jr = −ηo(l−1)
ij δ(l)ir
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 4/19
Network Pruning
Compressing the network
Once W ∗ has been computed, this technique simply set a thresholdτ ∈ [0, 100) integer and then set to 0 all the weights lower that theτ -th percentile of weight distribution
Retrain the network by clamping weights set to 0, obtaining W
The compression rate γP depends on the storage of W
Other possibility: sparsifying during the training...
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 5/19
Weight SharingNN compression technique [Han et a.l 2016] which performs ak-clustering of connection W ∗, thus having only k distinct weightsc1, c2, . . . ck , correspoding to the centroids of the k computed clustersC1, . . . ,Ck
Retrain the network by adopting a backpropagation cumulative acrossclusters:
∂LΘ∂Ck
=∑j,r
∂LΘ∂Wjr
∂Wjr∂Ck
=∑j,r
∂LΘ∂Wjr
1(Sjr = k)
where Sjr = q if Wjr belongs to cluster Cq, q ∈ {1, . . . , k}
If k < 256, it requires 1 byte for each entry of S, and 1 float (4 or 8bytes) for each ck
Limitation: we cannot use exactly dlog2 ke bits for representingentries of S, but 1, 2, 4 or 8 bytes
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 6/19
Pruning + Weight Sharing
A combined compression strategy can be adopted: Training +Pruning + WS + Re-training
Learning issues: the way of storing the sparse matrix might conflictwith the way centroids and index matrix S are storedComputing the network function φΘ can be slowered
Ongoing study (collaboration with UNIPO): utilizing arepresentation suitable for sparse matrices containing repeated values
- The aim is to get a compression function ψ(τ, k) depending on thenumber of clusters used and on the pruning threshold
- Then, given the compression ratio desired ρ, compute the optimal NNmodel having such compression ratio, that is (τ , k) =arg max
τ,k,ρ≤ψ(τ,k)A(τ, k), where A(τ, k) is the testing accuracy of the NN
model compressed using τ ad pruning percentile and k as number ofWS clusters
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 7/19
T2 - D2.1. Experiment 1 - MNIST Digits Classification
Problem: Prediction of handwritten digits images from MNISTInput: 786-dimensional vector obtained by vectorizing the 28× 28 matrixrepresenting the input grayscale image. Each cell is a value in [0, 255]representing the black level: 0 =white, 255 =black. 60000 images fortraining and 10000 images for testing
– Classes are almost balanced [5421 (0) to 6742 (1) training examples]
– Targets are 10-dimensional vectors t, with ti = 1 if the image is digit i , 0 otherwise
Optput: real valued vector of y size 10 representing the prediction for oneof the digits (softmax)Model: a Feed-Forward neural network with variable number of hiddenlayers and neurons, 786 input neurons and 10 output neurons.
– Activation functions: sigmoid or ReLU, for hidden and output layers– learnig rate and batch size have been tuned
Other configurations: τ , percentile of pruning; ki number of distinctweights at layer i for the wheight sharing technique
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 8/19
T2 - D2.1. Experiment 1, results
Tabella: Just pruning. Acc: test accuracy. τ ; pruning percentile. k: number ofclusters.
H.Neurons τ k Base Model Compr. Mod
Acc Train time (s) Acc Train time
[200] 10 - 0.982 919 0.982 32[200] 20 - 0.982 59[200] 25 - 0.982 115[200] 30 - 0.982 144[200] 40 - 0.982 349[200] 50 - 0.983 789[200] 60 - 0.982 1675[200] 70 - 0.981 2772[200] 75 - 0.982 3139[200] 80 - 0.981 2963[200] 90 - 0.978 2633
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 9/19
T2 - D2.1. Experiment 1, results 2
Tabella: Weight Sharing. Acc: test accuracy. τ : pruning percentile. k: numberof clusters.
H.Neurons τ k Base Model Compr. Mod
Acc Train time (s) Acc Train time
[200] - [200 50] 0.982 919 0.982 191[200] - [150 50] 0.982 223[200] - [100 50] 0.981 247[200] - [50 50] 0.981 351[200] - [50 25] 0.982 264[200] - [25 50] 0.982 284[200] - [25 25] 0.980 1498[200] - [25 10] 0.981 710[200] - [10 25] 0.978 133[200] - [10 10] 0.979 109
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 10/19
T2 - D2.1. Experiment 1, results 3Tabella: Pruning + WS. Acc: test accuracy. τ : pruning percentile. k: number of clusters.
H.Neurons τ k Base Model Compr. Mod
Acc Train time (s) Acc Train time
[200] 25 [100 50] 0.982 919 0.981 442[200] 50 [100 50] 0.981 4760[200] 75 [100 50] 0.974 4746
[200] 50 [50 50] 0.980 6177[200] 50 [150 50] 0.981 4768[200] 50 [200 50] 0.981 4115[200] 50 [100 100] 0.980 4919[200] 50 [150 100] 0.980 4738[200] 50 [100 150] 0.980 4807[200] 50 [25 25] 0.980 3965[200] 50 [50 25] 0.981 5992[200] 50 [25 50] 0.980 5757[200] 50 [10 10] 0.981 1776[200] 50 [25 10] 0.981 5715[200] 50 [10 25] 0.979 2246
[200] 75 [10 10] 0.965 455[200] 75 [25 10] 0.971 1430[200] 75 [50 10] 0.972 3210[200] 75 [500 50] 0.976 5713[200] 75 [750 50] 0.976 5704[200] 75 [750 100] 0.977 5671
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 11/19
PRIN17 - UNIMI Short- Mid-term objectives
[T2 - D2.1].Jointly with UNIPO, extensively testing the succinct representationfor the sparse matrices obtained after the application of pruning +WS, with a particular focus on the query efficiency as well (onmultiple data sets and learning problems)
Studying appropriate optimization procedures for selecting the model,among those guaranteeing the required compression
Investigating other approaches to make sparse the weight matrix, e.g.adopting loss regularization terms which sparsify connections duringmodel training
Spanning the literature to find other promising compressiontechniques, even for a massive comparison
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 12/19
PRIN17 - UNIMI Short- Mid-term objectives
We can also give a contribution to
[T1 - D1.2] A collection of known and new implementations of ML-basedand compressed data structures, implementing the procedures justdescribed.Such implementations would be used in the next tasks T3 and T4 as“building blocks” for the multicriteria framework
[T2 - D2.2] A collection of software prototypes for succinct ML-models
Possible other colloborations are welcome
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 13/19
PRIN17 - Task 1 - Deliverable 1.1
[T1 - D1.1] Compare classic Data Structures vs Purely Learned Indexes -Characterization of the space-time-accuracy performance of ML models interms of distribution of the input data, and their comparison against theknown (compressed) data structures
Collaboration with UNIPA unitPreliminary experiments about learned indexes with neural networks (NN) models[UNIPA] + NN compression techniques [UNIMI] to reduce the space occupancy + theircomparison with some SOA indexes [UNIPA+UNIMI]PROBLEM INPUT: set of integers X sorted [extendable to other type of elements]PROBLEM OUTPUT: a learned indexed for searching elements in x ∈ X (given x , returnpos(x) its position in X)STEP 1: Coding the input for feeding the NN [e.g. binary representation]STEP 2: Estimate the empirical cumulative distribution F of data with respect to X , andconsider as training examples the couples (x ∈ X , F (x))STEP 3: NN training, φ will be an estimate of FSTEP 4 [optional]: NN compressionSTEP 5: Index validation
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 14/19
PRIN17 - Task 1 - Deliverable 1.1
Adopted modelsFeed Forward neural net with 0 (NN0) 1 (NN1) or 2 (NN2) hiddenlayers
dlog2 qe (q = maxx∈X x) input and 1 output units, and a variablenumber of hidden units (when present)
Sigmoid activation for the output unit, sigmoid or ReLU for hiddenunits. Quadratic loss function.Data: three uniform distributions of integers of size 29 (Data1), 213(Data2), and 220 (Data3)
Inference:– compute φ(x) for each x ∈ X– compute the error e(x) := |φ(x)|X | − pos(x)|,where pos(x) is the position of x in the input sorted sequence
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 15/19
PRIN17 - T1 - D1.1, preliminary results
Tabella: Hidden layers have 256 units. Between parenthesis the best result withcompression.
Model Training Time (s) max e (emax)/|X| (emean)/|X|)
Data 1 - size 5.12 · 102
NN2 37 4 (2) 0.78 1.83 · 10−3 (6.28 · 10−4)NN1 25 3 (2) 0.78 1.60 · 10−3 (5.43 · 10−4)NN0 0.59 9 (8) 1.7 · 10−2 4.74 · 10−3 (4.58 · 10−3)
Data 2 - size 8.192 · 103
NN2 42.3 35 (32) 0.48 1.14 · 10−3 (8.18 · 10−4)NN1 35.8 33 (33) 0.4 1.12 · 10−3 (5.82 · 10−4)NN0 13.7 53 (43) 0.64 1.62 · 10−3 (1.52 · 10−3)
Data 3 - size 1.048 · 109
NN2 470 1270 (660) 0.12 2.4 · 10−4 (1.45 · 10−4)NN1 438 1031 (292) 9.0 · 10−2 2.33 · 10−4 (5.11 · 10−5)NN0 267 905 (706) 8.6 · 10−2 1.85 · 10−4 (1.82 · 10−4)
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 16/19
PRIN17 - T1 - D1.1, preliminary results 2Extending the learned index in the sense of Kraska et al. 2017Divide the input sequences in s splits and learn a NN0 model for eachModel s max e
Data 1 - size 5.12 · 102
NN2 1 4NN1 1 3NN0 1 9NN0 2 6, 7NN0 4 3, 5, 8, 7NN0 8 3, 4, 6, 3, 3, 4, 3, 10
Data 2 - size 8.192 · 103
NN2 1 35NN1 1 33NN0 1 53
NN0 2 38, 22NN0 4 25, 43, 156, 24NN0 8 17, 25, 28, 27, 15, 20, 20, 19
Model s max e
Data 3 - size 1.048 · 109
NN2 1 1270NN1 1 1031NN0 1 905
NN0 2 615, 743NN0 4 550, 625, 757, 516NN0 8 568, 328, 299, 327, 229, 319, 296, 386
Best results: maximum value testeds = 128 (if possible)
Data 1, s = 20, emax = 1Data 2, s = 72, emax = 9Data 3, s = 112, emax = 99
Note: the larger s, the more the spaceoccupied
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 17/19
PRIN17 - Task 1 - D1.1: Ongoing research
The problem is that, in this setting, we aim at minimizing themaximum error, but the quadratic loss function is not a continuousapproximation of the maximum function. When ei is the error oversample xi (d = 1), each weight Wjr is updated of an amountproportional to
∂LΘ
∂W (H)jr
=n∑
i=1
∂LΘ
∂o(H)ir
∂o(H)ir
∂W (H)jr
=n∑
i=12ei
∂o(H)ir
∂W (H)jr
In principle, we need a loss function emphasizing the larger errors,while neglecting the smaller ones
There are some possible solutions for loss function approximating themaximum function Lmax (e1, . . . , en) = max
i∈{1,...,n}ei
We are now running some experiments by using some loss functionsapproximating Lmax
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 18/19
PRIN17 - Task 1 - D1.1: Short- Mid-term objectives
[T1 - D1.1]. Short- Mid-term objectives
Once terminated this investigation:
Actually we can relate the expected space occupancy to thecompression rate, but we aim (possibly) at relating (empirically atleast) compression rate with the maximum error
Characterize the effectiveness of NN compression techniques withdifferent data distributions
Comparing with other ‘classical’ indexes (B-trees, B+ trees, etc.)(UNIPA is already working on that)
Dipartimento Informatica - Unità UNIMI UNIMI - PRIN17 Kick Off Meeting 19/19