35
Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures Joy Arulraj, Guoliang Jin and Shan Lu

Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Joy Arulraj, Guoliang Jin and Shan Lu

Page 2: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Production-Run Failure Diagnosis

•  Goal – Figure out root cause of failure on client machines – Fix them quickly

2

Page 3: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Importance

•  Social and financial impact – Toyota Prius software glitch – NASDAQ Facebook IPO glitch

3

Page 4: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Challenges

•  Limited program execution information – Performance and privacy reasons

•  Complicated root cause – Sequential bugs – Concurrency bugs

•  Need to diagnose and fix quickly

4

Page 5: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Existing Tools

(1)  During entire execution (2) At failure-site

(3) Sampling

5

Page 6: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Limitations Of Existing Tools

Diagnosis Latency

Performance Overhead

Low

High

Low High

(1) Entire-execution approach

(2) Failure-site approach

6

Page 7: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Limitations Of Existing Tools

Diagnosis Latency

Performance Overhead

Low

High

Low High

(1) Entire-execution approach

(2) Failure-site approach

(3) Sampling

1/100 sampling rate è ~100 failures required for diagnosis

7

Page 8: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Challenge

•  Low performance overhead – Collect little execution information

•  Low diagnosis latency – Collect root-cause related information

Which part of the program execution is most likely to contain root-cause information?

8

Page 9: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Our Solution: Last Execution Record

•  Execution right before failure – Last Execution Record (LXR)

•  How to collect this information efficiently? – Leverage simple hardware support

Last Execution Record (LXR)

program execution

9

Page 10: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Last Execution Record (LXR)

Diagnosis Latency

Performance Overhead

Low

High

Low High

(1) Entire-execution approach

(2) Failure-site approach

(3) Sampling

Last Execution Record

10

Page 11: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Outline

•  LXR Design •  Failure diagnosis using LXR •  Evaluation

11

Page 12: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

LXR Design Questions & Principles

•  What should we collect in LXR? – Useful for failure diagnosis

•  How to collect LXR? – Lightweight to collect

12

Page 13: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

LXR Design For Sequential Bugs

•  What should we collect in LXR? – Recently taken branches

•  How to collect LXR? – Use existing hardware support -- LBR

13

Page 14: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Last Branch Record (LBR)

•  Existing hardware feature – Set of recently taken branches – Circular buffer with 16 entries (Intel Nehalem) – Overhead: negligible

Branch Source Instruction Pointer  

Branch Target Instruction Pointer  

Lightweight 14

Page 15: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Is LBR useful ?

•  Root-cause of many types of sequential bugs [PLDI.2005]

•  Error-propagation distances tend to be short [DSN.2003]

Branches Abnormal

Control Flow

Semantic Bugs

Memory Bugs

reflect

Useful 15

Page 16: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

•  Coreutils: sort -m -o file1 file1

// SORT.C void merge (...) { … open_input_files(...); }

int open_input_files (...) { if (files[i].pid != 0) tableèbucket = val; else … }

int open_input_files (...) { if (files[i].pid != 0) /* child process */ tableèbucket = val; else … }

CALL STACK

open_input_files()

merge()

main() failure

16

Sequential Bug Example

Page 17: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

•  Coreutils: sort -m -o file1 file1 // SORT.C void merge (...) { avoid_trashing_input(…); … open_input_files(...); }

int avoid_trashing_input (...) { if(…) { int num_merged = 0; while (i + num_merged < nfiles) { num_merged += mergefiles(...); memmove(&files[i], &files[i+num_merged],); } } }

int open_input_files (...) { if (files[i].pid != 0) tableèbucket = val; else … }

int open_input_files (...) { if (files[i].pid != 0) tableèbucket = val; else … }

failure

LBR if (files[i].pid != 0) ...

while (i+num_merged < nfiles)

Is LBR useful ?

17

Page 18: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

LXR Design For Concurrency Bugs

•  What should we collect in LXR? – Recently executed cache-access instructions – Cache-coherence state observed (M/E/S/I)

•  How to maintain and collect LXR? – Key hardware feature already exists – Propose a simple hardware extension to use that

18

Page 19: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Last Cache-coherence Record (LCR)

•  Existing hardware feature – Configurable cache-coherence event counting

•  Extension: – Buffer to collect this information – Set of recent L1 data cache access instructions

•  Overhead: not perceivable

Cache-access Instruction Pointer

Cache-coherence State (M/E/S/I)

Lightweight 19

Page 20: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Is LCR Useful ?

•  Related to concurrency bug root-causes

[ASPLOS.2013] •  Error-propagation distances tend to be short

[ASPLOS.2011]

Cache-coherence Information

Abnormal Interleavings

Atomicity-Violation Bugs

Order-Violation Bugs

reflects

20 Useful

Page 21: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

•  Mozilla JavaScript Engine InitState(...){

table = New();

if (table == NULL) { ReportOutOfMemory();

return JS_FALSE;

} }

ReportOutOfMemory(){ error("out of memory"); }

failure

CALL STACK

ReportOutofMemory()

InitState()

main()

Is LCR useful ?

21

Page 22: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

InitState(...){

table = New();

if (table == NULL) { ReportOutOfMemory();

return JS_FALSE;

} } ReportOutOfMemory(){ error("out of memory"); }

FreeState(...){

Destroy(table);

table = NULL; }

Thread 2 Thread 1 Invalid Modified Invalid Exclusive Invalid Modified

Is LCR useful ?

Modified

W

R W

R

•  Mozilla JavaScript Engine: Success Run

22

Page 23: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

•  Mozilla JavaScript Engine: Failure Run

InitState(...){

table = New();

if (table == NULL) { ReportOutOfMemory();

return JS_FALSE;

} } ReportOutOfMemory(){ error("out of memory"); }

FreeState(...){

Destroy(table);

table = NULL; }

Thread 2 Thread 1

failure

Invalid Shared Invalid Modified Modified Shared Invalid

Is LCR useful ?

Invalid

LCR table == NULL Invalid

table = New() Invalid

… …

W

R W

R

23

Page 24: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Outline

•  LXR Design •  Failure diagnosis using LXR •  Evaluation

24

Page 25: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Manual failure diagnosis

•  Enhance logging by collecting LXR – Existing failure logging functions – Signal handler

// failure logging function error_wrapper(args){

DISABLE_LXR(); PROFILE_LXR(); error (args);

}

// signal handler void handler(int signo) { DISABLE_LXR(); PROFILE_LXR(); … }

25

Page 26: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Automated failure diagnosis

•  Collect LXR in both failure and success runs •  Statistical analysis – Automatically identify failure predictors

LCR AUTOMATED Score table == NULL 0.91

table = New() 0.56

… …

LCR table == NULL Invalid

table = New() Invalid

… …

LCR table == NULL Modified

table = New() Invalid

… …

✔ ✖

26

Page 27: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Implementation details

•  LBR exposed via Linux kernel module – Enable, configure and access using our interface – Reducing LBR pollution from irrelevant branches – Details in paper

•  LCR simulated using PIN infrastructure – L1 data cache with MESI coherence protocol –  Interface similar to LBR – Details in paper

27

Page 28: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Outline

•  LXR Design •  Failure diagnosis using LXR •  Evaluation

28

Page 29: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Methodology

•  31 real-world failures –  In open-source server, client, utility programs – 20 sequential and 11 concurrency bugs

•  Compared against CBI/CCI – State-of-the-art software-based tools – Diagnose sequential and concurrency bugs – Perform sampling to lower overhead

29

Page 30: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Does LBR help locate root cause ? APPLICATION LBR

MANUAL (Nth entry)

LBR AUTOMATED

(Nth entry)

CBI (Nth entry)

Apache 3 1 2

Squid 2 1 -

Coreutils 12 1 2

Tar 4 1 1

PBZIP 4 1 -

•  Root-cause branch mostly in recent 8 LBR entries •  So, even short-term LBR memory sufficient !

30

Page 31: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Cross-checking LBR with patches

•  LBR entries are much closer to patch for most bugs

APPLICATION FAILURE SITE TO PATCH (LoC)

LBR TO PATCH (LoC)

Apache Another file 3

Squid 123 2

Coreutils 309 0

Tar Another file 2

PBZIP 41 1

31 Patch LBR Failure

Page 32: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Diagnosis Latency

TOOL DIAGNOSIS LATENCY

SAMPLING

Manual LBR 1 failure run No

Automated LBR 10 failure runs No

CBI 1000 failure runs Yes

•  LBR tools need fewer failure runs for diagnosis •  CBI uses sampling which increases latency

32

Page 33: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Performance Overhead

0.1

1

10

100

Apache Squid Coreutils Tar PBZIP

Perf

orm

ance

Ove

rhea

d (%

)

Manual LBR Automated LBR CBI

33

Page 34: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Does LCR help locate root cause ? APPLICATION LCR

MANUAL (Nth entry)

LCR AUTOMATED

(Nth entry)

CCI (Nth entry)

Apache 5 1 1

Cherokee - - -

Mozilla 8 1 1

MySQL 9 1 -

PBZIP 7 1 1

•  Locates root-cause in 7 out of 11 failures •  So, short-term LCR memory sufficient

34

Page 35: Leveraging the Short-Term Memory of Hardware to Diagnose ...jarulraj/talks/2014.lbr.asplos.pdfLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures

Summary

Diagnosis Latency

Performance Overhead

Low

High

Low High

(1) Entire-execution approach

(2) Failure-site approach

(3) Sampling

Last Execution Record

35