30
CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1 University of Waterloo, Canada 2 University of Science and Technology of China, China 3 Purdue University, USA Lin Tan 3 Thibaud Lutellier 1 Weizhen Qi 2 Hung Viet Pham 1

CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries

1University of Waterloo, Canada2University of Science and Technology of China, China

3Purdue University, USA

Lin Tan3Thibaud Lutellier1 Weizhen Qi2Hung Viet Pham1

Page 2: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Deep learning (DL) is pervasive

Machine translation Alzheimer’s disease diagnosis

Autonomous driving cars Virtual assistance2

Page 3: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

DL system

Correct DL systems require correct implementations

3

Algorithms / Models Implementations

CRADLECRADLE

Page 4: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

DL libraries are hard to test and debug

● Intrinsic complexity● DL system expected output is

unknown○ Correct programs should output

expected output.○ The ground truth is not the expected

output because models are not perfect.

4

MobileNetV2 - TensorFlow: bananaGround-truth: banana

MobileNetV2 Expected output: tennis ball

Page 5: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Idea: Differential testing

InceptionResNetV2 Model

An inconsistency

A “petri-dish” image

TensorFlow classificationTensorFlow backend

InceptionResNetV2TensorFlow

InceptionResNetV2CNTK

CNTK backend

CNTK classification

5

Page 6: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

● The CNTK batch normalization formula was implemented incorrectly.

● The developers fixed the bug after we reported it.

Batch_normalization bug

6

- return(x-mean)/(C.sqrt(var)+epsilon)*gamma+beta + return(x-mean)/ C.sqrt(var +epsilon)*gamma+beta

Page 7: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Differential testing: Challenges

● How to compare two implementations?○ What metric to use?○ What should be considered bugs?

● How to localize the faults?○ How to find faults in the complex model executions?

7

Page 8: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Differential testing: Ideas

● Two metrics measure the severity of the inconsistency for a set of input instances.

● Localization map compares intermediate states of DL models for fault localization.

8

Page 9: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

CRADLE: Overview

9

Localization phase

Detection phase

Output extractor

Crash bugs

Model output Unique inconsistencies

Trained models&

Validation data

Hidden states extractorHidden statesInconsistency

localizerLocalization

maps

Output comparator

Inconsistency bugs

Page 10: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

CRADLE: Detection phase

10

Localization phase

Detection phase

Output extractor

Crash bugs

Model output Unique inconsistencies

Trained models&

Validation data

Hidden states extractorHidden statesInconsistency

localizerLocalization

maps

Output comparator

Inconsistency bugs

Page 11: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Output extractor

11

● Executes the models on different backends to obtain output● Detects crashes

InceptionResNetV2 ModelAn “petri-dish” image

InceptionResNetV2CNTK

CNTK backend

CNTK classification

Page 12: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

MAD-based (Regression)CLASS-based (Classification)

Output comparator: Distance metrics

12

Metrics calculate difference relatively to the ground-truth.

Page 13: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

σpetri-dish,TF = 25-1 = 16

CLASS-based distance example

13

TensorFlow CNTK

σpetri-dish,CN = 0

|σpetri-dish,CN - σpetri-dish,CN| = 16

Rankpetri-dish,TF = 1 Rankpetri-dish,CN > 5

Top-5 classification

Page 14: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Inconsistency triggering input (ITI)● An input instance triggers a distance larger than a

threshold (TC and TM)○ E.g.,: “petri-dish” image is an ITI given TC = 8.

14

Theano: Indian elephantTensorFlow: groom

CNTK: groom

TensorFlow: bananaCNTK: tennis ball

Theano: tennis ball

CNTK: Arabian camelTensorFlow: hen

Theano: hen

Page 15: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Detect inconsistency

● An inconsistency is a pair of implementations that triggers more than p% of ITIs over the validation set

15

InceptionResNetV2TensorFlow

InceptionResNetV2CNTK

16

6

0

Validation set

TC = 8

D_CLASS

p = 10%

Page 16: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

CRADLE: Localization phase

16

Localization phase

Detection phase

Output extractor

Crash bugs

Model output Unique inconsistencies

Trained models&

Validation data

Hidden states extractorHidden statesInconsistency

localizerLocalization

maps

Output comparator

Inconsistency bugs

Page 17: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Hidden state extractor

● The “most inconsistent” input per inconsistency is used.● The network structure + hidden states are considered as

the network execution graph.● Hidden states are output of hidden layers.

17

BatchNormConv2D GloAvgPool

776 layers omitted

Activation

Dense

TensorFlow: jean

Input: jean Conv2D BatchNorm Activation GloAvgPool

InceptionResNetV2 execution graph on TensorFlow

Page 18: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

MAD differences

18

BatchNormConv2D GloAvgPool

776 layers omitted

Activation

Dense

TensorFlow: jean

Conv2D BatchNorm Activation GloAvgPool

BatchNormConv2D GloAvgPool

776 layers omitted

Activation

Dense

CNTK: mail bag

Conv2D BatchNorm Activation GloAvgPool

Input: jean

BatchNorm𝛿 = 0.0002

Conv2D𝛿 = 0.0

GloAvgPool𝛿 = 0.0860

Activation𝛿 = 0.1480

776 layers omitted

Dense𝛿 = 0.0004

Page 19: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Inconsistency introduction rate

● Calculate the rate of change○ ∊ prevent division by zero

● Highlight executions with R above the third quantile

InceptionResNetV2 localization map between TensorFlow and CNTK19

BatchNorm𝛿 = 0.0002R = 2048.6

Conv2D𝛿 = 0.0R = 0.0

Activation𝛿 < 0.0001R =-0.5497

Conv2D𝛿 = 0.0003R = 2.3009

GloAvgPool𝛿 = 0.0860R =-0.4191

772 layers omitted

Activation𝛿 = 0.1480R =-0.5173

Dense𝛿 = 0.0004R =-0.9950

TensorFlow: jeanCNTK: mailbag

Input: jean

Conv2D𝛿 = 0.0138R = 0.5530

BatchNorm𝛿 = 0.3067R = 21.186

Page 20: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

20

104 unique inconsistencies

3 backends

28 models 11 datasets

7 inconsistency bugs 5 crash bugs

Result

Page 21: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

fx

21

7 inconsistency bugs

Batch normalization BatchNormalization

Padding scheme Conv2D variant

Pooling scheme AveragePooling2D

Parameter organization Trainable Conv

Page 22: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

RelevantOne ofFirst

22

Localization is helpful

Relevant to the causes of all 104 unique inconsistencies

Page 23: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Conclusion

● CRADLE applies differential testing on DL implementations and localize faulty functions by tracking error propagation.○ Detects 7 confirmed inconsistency bugs and 5 crash bugs○ Helps find root causes of all 104 unique inconsistencies using

localization maps ● Inconsistencies are common and widespread.● We call for more attention to testing of DL libraries.

23

Page 24: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

DL system overview

24

High-levelLibraries

Hardware

Low-levelLibraries TensorFlow Theano CNTK

CPU GPU

Keras

User code

Interface

Backend ...

Page 25: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Group unique inconsistency

● A group of inconsistencies with the same inconsistency pattern between the same pair of implementations○ Inconsistency pattern is the distribution of metric distance

25

Page 26: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Suggested settings

● Grid search on TC, TM, and p values● Optimal settings (most inconsistency without false

negative and false positive) are:○ CLASS-based: TC = 8 and p = 0%○ MAD-based: TM = 0.2 and p = 0%

● Confirm using cross-validation

26

Page 27: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Dataset and hardware

● Dataset:○ 11 datasets including ImageNet, MNIST, Udachi Driving

Challenge 2, etc. ○ 30 pre-trained models

● Hardware:○ Xeon E5-2695○ NVIDIA Titan Xp

27

Page 28: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Detected inconsistencies

The numbers outside and (inside) brackets are the unique and (total) number of inconsistencies respectively.28

Page 29: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Comparison to accuracy

● Detect inconsistency if the top-k accuracy difference is above a threshold TAC

● We pick k between 1 to 5 and TAC between 0 and 50● With TAC = 0, top-1 accuracy detects the most

inconsistencies (305) but still missed 35○ E.g., for the Dog species model, the Batch_normalization

bugs induce inconsistency between TensorFlow and CNTK○ However, those backends got identical top-1 (29.9%) and top-5

(64.4%) accuracies

29

Page 30: CRADLE: Cross-Backend Validation to Detect and Localize ... · CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries 1University of Waterloo, Canada

Future work

● Detect inconsistencies and bugs in training code○ Harder since training is non-deterministic

● Generate mutated models using fuzzing to expand testing set

● Testing with only one backend with equivalent models

30