Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Keras’s PhylanxBackend
Bita Hasheminezhad
STE||AR GROUP
Outline
2
– Available Deep Learning Platforms
– What’s Special about Keras
– Keras Backends
– Inference Example 1; Multi-Class classification
– Inference Example 2; Sentiment Analysis
– Keras In Future
– Conclusion
Deep Learning Platforms
3
– Spark Apache
– DistBelief Google
– TensorFlow Google
– CNTK Microsoft
– Project Adam Microsoft
– MXNet Apache
– CoreML Apple
– Theano Universite de Montreal
– Caffe Berkeley AI Research
– Caffe2 Facebook
– PyTorch Facebook
– SINGA National university of Singapore
– Chainer Preferred Networks
Support Keras
Deep Learning Platforms
4
– Spark Apache
– DistBelief -> TensorFlow Google
– CNTK Microsoft
– Project Adam Microsoft
– MXNet Apache
– CoreML Apple
– Theano Universite de Montreal
– Caffe -> Caffe2 -> PyTorch Facebook
– SINGA National university of Singapore
– Chainer Preferred Networks
What is Keras?
5
– Keras is a high-level neural networks API, written in Python and capable of running on top of a deferred execution backend.
User friendly
Modular
Easily extensible
[1] https://app.dimensions.ai/discover/publication
Fig 1. Number of publications during the last decade having the name of the DL platform in their full text1
Deferred Style
Imperative or Eager Style
Deep Learning Platforms
6
– TensorFlow Google
– CNTK Microsoft
– Theano Universite de Montreal
– MXNet Apache
– CoreML Apple
– Caffe -> Caffe2 -> PyTorch Facebook
[2] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (pp. 265-283).c
– Deferred Execution: it has two distinct phases: the first phase defines the program as a symbolic graph; and the second phase executes an optimized version of the program on the set of available devices.2
Keras different backends
7
Platform Data Parallelism Model Parallelism
TensorFlow Synchronous or asynchronous through parameter servers
Supported using greedy heuristics
CNTK Bounded asynchronous through a parameter server model
Theano Not on multiple nodes
MXNet Synchronous or asynchronous through parameter servers
Not on multiple nodes
Table 1. Investigating parallelism in the deep learning platforms supported by Keras
– “When gradient nodes are automatically added to the graph, the user has less control, and the heuristics may break down. ”2
The solution to the problem
8[3] Zhang, Z., Yin, L., Peng, Y., & Li, D. (2018, December). A Quick Survey on Large Scale Distributed Deep Learning Systems.In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) (pp. 1052-1056). IEEE.
Problem:
On a single node, training ResNet50 on the ImageNet data set on an NVIDIA M40 GPU takes 14 days!3
Solution:
A High-Performance Keras Backend which
Is deferred style; can optimize the expression graph
Is distributed and can run on multiple nodes
Uses asynchronous computations; avoids straggler problem
Let’s use HPX!
HPX as a backend for Keras
9
– Using hints from the user and the optimization step the expression graph will be passed to the HPX runtime which schedules work and infers the data layout on each compute locale arguments.4
[4] http://phylanx.stellar-group.org/
HPX (C++)Keras (Python)
Phylanx
(Python Frontend,C++ Backend)
How to implement a Keras Backend
10
backendepsilon set_epsilon
floatx cast_to_floatx
set_floatximage_data_format
set_image_data_format
normalize_data_format
reset_uidsget_uid
learning_phase set_learning_phase
is_sparse to_dense
variable eval
update update_add update_sub
is_keras_tensoris_tensorplaceholder is_placeholder
moving_average_update identity
random_uniform_variablerandom_normal_variable
get_value set_value
name_scope
print_tensorfunction
gradients stop_gradient
in_test_phasein_train_phase
constantzerosoneseye zeros_likeones_like
max min sum prod cumsum cumprod argmax argminvar std mean square sqrt abs exp log logsumexp
dtype cast dot batch_dot transpose gather in_top_k
any all equal not_equal greater greater_equal less less_equalround sign pow clip maximum minimum permute_dimensions
shape int_shape ndim count_paramsreshape
cossinconcatenate stack repeat_elements repeat tilearange flatten expand_dims squeeze one_hot reverse
slice switch bias_add dropout l2_normalize
truncated_normalrandom_normalrandom_uniformrandom_binomial
resize_images resize_volumes
conv1d conv2d conv3dseparable_conv1d separable_conv2d
local_conv1d local_conv2d
pool2d pool3d
temporal_padding spatial_2d_padding spatial_3d_paddingconv2d_transpose conv3d_transpose depthwise_conv2d
map_fnfoldl flodr
rnn ctc_decodectc_label_dense_to_sparse
ctc_batch_cost
relu elu tanhsoftmax softplus softsign
sigmoid hard_sigmoid
binary_crossentropycategorical_crossentropysparse_categorical_crossentropy
batch_get_value batch_set_value
normalize_batch_in_trainingbatch_flatten batch_normalization
Keras related
main
Basic math
Convolutional
Recurrent
Batch related
Late Binding
Activations and Losses
Up to 4D
Inference
Not yet
Training
Inference Example 1; Multi-Class Classification
11
from keras import backend as K
from keras.datasets import mnist
from keras.utils import to_categorical
import numpy as np
import pandas as pd
(_,y_train), (x_test, y_test) = mnist.load_data()
num_classes = len(np.unique(y_train))
# convert class vectors to binary class matrices
print("y_train shape:", y_train.shape)
y_train = to_categorical(y_train, num_classes)
print("y_train shape, after one_hot encoding:", y_train.shape)
df = pd.read_csv('class_pred.csv')
class_pred = df.values
print("class_predict shape:", class_pred.shape)
print("A sample of class_predict:", class_pred[0])
labels_pred = K.argmax(class_pred, axis=1)
print("Predicted labels:", K.get_value(labels_pred))
print("What we have on y_test", K.eval(y_test))
Using Phylanx backend.
y_train shape: (60000,)
y_train shape, after one_hot encoding: (60000, 10)
class_predict shape: (10000, 10)
A sample of class_predict: [3.27987540e-37 1.93442800e-25 5.78854500e-25 1.94946260e-21
3.15305600e-31 1.03375155e-32 0.00000000e+00 1.00000000e+00 4.98417950e-32 3.93246830e-21]
Predicted labels: [7 2 1 ... 4 5 6]
What we have on y_test [7 2 1 ... 4 5 6]
Correct labels: [1 1 1 ... 1 1 1]
Number of corrects predictions: 9837
Accuracy: 98.37%
Label 4 is misclassified as 2
Label 2 is misclassified as 7
Label 5 is misclassified as 3
Label 3 is misclassified as 7
Label 6 is misclassified as 0
Label 9 is misclassified as 3
Label 8 is misclassified as 2
Label 2 is misclassified as 7
Label 8 is misclassified as 4
corrects = K.equal(labels_pred, y_test)
corrects = K.cast(corrects, 'int64')
print("Correct labels:", K.get_value(corrects))
number_of_corrects = K.get_value(K.sum(corrects))
print("Number of corrects predictions: %d "%number_of_corrects)
corrects = K.expand_dims(corrects, axis=0)
num_images = K.int_shape(corrects)[1]
print("Accuracy: %.2f%%" % ((number_of_corrects*100)/num_images))
# Misclassified
incorrects = K.not_equal(corrects, 1)
incorrects = K.eval(incorrects)
labels_error = (lambda x: x[0] * x[1])([K.eval(labels_pred), incorrects])
labels_true = (lambda x: x[0] * x[1])([K.eval(y_test), incorrects])
labels_true_slice = K.slice(K.squeeze(K.variable(labels_true),0),[0],[500])
labels_error_slice = K.slice(K.flatten(K.variable(labels_error)),[0],[500])
for i,j in zip(K.get_value(labels_true_slice), K.get_value(labels_error_slice)):
if i != 0:
print("Label",i,"is misclassified as",j)
12
y_score = K.gather(labels_pred, desc_score_indices)
y_true = K.gather(y_true, desc_score_indices)
y_true = K.cast(y_true, "int64")
print("y_true", K.get_value(y_true))
print("y_score", K.get_value(y_score))
diff = np.diff(K.eval(y_score))
distinct_value_indices = where(K.not_equal(diff, 0))
distinct_value_indices = K.get_value(distinct_value_indices)[0]
print("distinct_value_indices", distinct_value_indices)
threshold_idxs = K.eval(K.concatenate
([K.variable(distinct_value_indices),K.variable(np.array([largest_index]))], 0))
# accumulate the true positives with decreasing threshold
tps = K.get_value(K.gather(K.cumsum(y_true), threshold_idxs))
print("True Positives:", tps)
fps = 1 + threshold_idxs - tps
print("False Positives:", fps)
thresholds = K.get_value(K.gather(y_score, threshold_idxs))
print("Decreasing Threshold:", thresholds)
plot_roc_curve(tps, fps, thresholds)
Using Phylanx backend.
classes: [0 1]
Predicted labels: [9.2614290e-03 9.9999920e-01 9.9997926e-01
... 6.2763690e-05 3.3009052e-03 6.0482204e-01]
Accuracy: 86.704
desc_score_indices [12420 1594 2351 ... 11280 13389 18853]
y_true [1 1 1 ... 0 0 0]
y_score [1. 1. 1. ... 0. 0. 0.]
distinct_value_indices [ 484 593 981 ... 23598 23794 23973]
True Positives: [ 483 591 975 ... 12493 12495 12500]
False Positives: [ 2 3 7 ... 11302 11479 12500]
Decreasing Threshold: [1.0000000e+00 9.9999994e-01
9.9999990e-01 ... 5.9604645e-08 2.9802322e-08
0.0000000e+00]
Inference Example 2; Sentiment Analysisfrom keras import backend as K
from keras.datasets import imdb
import numpy as np
import pandas as pd
@Phylanx
def unique_eager(x):
return np.unique(x)
unique = Phylanx.lazy(unique_eager)
@Phylanx
def argsort_eager(x):
return np.argsort(x)
argsort = Phylanx.lazy(argsort_eager)
@Phylanx
def where_eager(x):
return np.where(x)
where = Phylanx.lazy(where_eager)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
classes = unique(K.variable(y_test))
print("classes:", K.get_value(classes))
largest_index = y_test.size - 1
df = pd.read_csv('labels_pred.csv')
labels_pred = K.squeeze(df.values, 1)
print("Predicted labels:", K.get_value(labels_pred))
plt.hist(K.get_value(labels_pred))
plt.show()
corrects = K.less(K.abs(labels_pred - K.variable(y_test)), .5)
print("Accuracy:", K.get_value(K.sum(K.cast(corrects,"int32")))*100/
K.int_shape(corrects)[0])
y_true = K.equal(y_test, 1.)
# sort scores and corresponding truth values
indices = argsort(labels_pred)
desc_score_indices = K.eval(K.reverse(indices, 0))
print("desc_score_indices", desc_score_indices)
P
N
P TP
FN
TPR=TP/P FPR=FP/N
N
FP
TN
Keras in future
13[5] https://github.com/keras-team/keras/releases
TensorFlow Eager (2.0)
14[6] https://www.tensorflow.org/guide/effective_tf2Deferred Style
Imperative or Eager Style
– TensorFlow 1.0
– CNTK
– Theano
– MXNet
– CoreML
– PyTorch
TensorFlow 2.0
Performance of TF Eager
15
– “We expect most real-world models to fall somewhere between these two, and to be able to recover performance by staging as required.”7
– “TensorFlow Eager is an evolving technology and closing the gap between imperative and staged performance is being worked on.”7
[7] Agrawal, A., Modi, A. N., Passos, A., Lavoie, A., Agarwal, A., Shankar, A., ... & Cai, S. (2019). Tensorflow eager: A multi-stage, python-embedded dsl for machine learning. arXiv preprint arXiv:1903.01855.
Fig 6. Examples per second training ResNet-50 on a GPU
Fig 7. Examples per second training L2HMC on a CPU
Where we are now
16
– We had a good progress on Phylanx backend of Keras
Many of needed primitives are implemented in Phylanx8
BlazeTensor has an acceptable support for 3D and 4D arrays9
– We need higher dimensionalities as in DL platforms we usually add batch of data and channels to the data dimension.
– “As one part of the development of TensorFlow, our team has extended the open source Eigen library with support for arbitrary dimensionality tensor operations.”10
Majority of Keras backend tests are passed11
[8] https://github.com/STEllAR-GROUP/phylanx[9] https://github.com/STEllAR-GROUP/blaze_tensor[10] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.[11] https://github.com/STEllAR-GROUP/keras
Thank you for your attention