84
Automated Debugging: Are We There Yet? Alessandro (Alex) Orso School of Computer Science – College of Computing Georgia Institute of Technology http://www.cc.gatech.edu/~orso/ Partially supported by: NSF, IBM, and MSR

Automated Debugging: Are We There Yet?

Embed Size (px)

DESCRIPTION

This is an updated version of the opening talk I gave on February 4, 2013 at the Dagstuhl Seminar on Fault Prediction, Localization, and Repair (http://goo.gl/vc6Cl). Abstract: Software debugging, which involves localizing, understanding, and removing the cause of a failure, is a notoriously difficult, extremely time consuming, and human-intensive activity. For this reason, researchers have invested a great deal of effort in developing automated techniques and tools for supporting various debugging tasks. Although potentially useful, most of these techniques have yet to fully demonstrate their practical effectiveness. Moreover, many current debugging approaches suffer from some common limitations and rely on several strong assumptions on both the characteristics of the code being debugged and how developers behave when debugging such code. This talk provides an overview of the state of the art in the broader area of software debugging, discuss strengths and weaknesses of the main existing debugging techniques, present a set of open challenges in this area, and sketch future research directions that may help address these challenges. The second part of the talk discusses joint work with Chris Parnin presented at ISSTA 2011: C. Parnin and A. Orso, "Are Automated Debugging Techniques Actually Helping Programmers?", in Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2011), pp 199-209, http://doi.acm.org/10.1145/2001420.2001445

Citation preview

Page 1: Automated Debugging: Are We There Yet?

Automated Debugging:Are We There Yet?

Alessandro (Alex) OrsoSchool of Computer Science – College of Computing

Georgia Institute of Technologyhttp://www.cc.gatech.edu/~orso/

Partially supported by: NSF, IBM, and MSR

Page 2: Automated Debugging: Are We There Yet?

Automated Debugging:Are We There Yet?

Alessandro (Alex) OrsoSchool of Computer Science – College of Computing

Georgia Institute of Technologyhttp://www.cc.gatech.edu/~orso/

Partially supported by: NSF, IBM, and MSR

Page 3: Automated Debugging: Are We There Yet?
Page 4: Automated Debugging: Are We There Yet?
Page 5: Automated Debugging: Are We There Yet?

How are we doing?

Are we there yet?

Where shall we go next?

Page 6: Automated Debugging: Are We There Yet?

How Are We Doing?A Short History of Debugging

Page 7: Automated Debugging: Are We There Yet?

The Birth of Debugging

First reference to software errors Your guess?

???

2013

Page 8: Automated Debugging: Are We There Yet?

The Birth of Debugging

• Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine

• Several uses of the term bug to indicate defects in computers and software

• First actual bug and actual debugging(Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University)

1840

2013

1843

Page 9: Automated Debugging: Are We There Yet?

The Birth of Debugging

• Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine

• Several uses of the term bug to indicate defects in computers and software

• First actual bug and actual debugging(Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University)

1840

2013

...

1940

Page 10: Automated Debugging: Are We There Yet?

The Birth of Debugging

• Software errors mentioned in Ada Byron's notes on Charles Babbage's analytical engine

• Several uses of the term bug to indicate defects in computers and software

• First actual bug and actual debugging(Admiral Grace Hopper’s associates working on Mark II Computer at Harvard University)

1840

2013

1947

Page 11: Automated Debugging: Are We There Yet?

Symbolic Debugging

• UNIVAC 1100’s FLIT(Fault Location by Interpretive Testing)

• Richard Stallman’s GDB

• DDD

• ...

1840

2013

1962

Page 12: Automated Debugging: Are We There Yet?

Symbolic Debugging

• UNIVAC 1100’s FLIT(Fault Location by Interpretive Testing)

• GDB

• DDD

• ...

1840

2013

1986

Page 13: Automated Debugging: Are We There Yet?

Symbolic Debugging

• UNIVAC 1100’s FLIT(Fault Location by Interpretive Testing)

• GDB

• DDD

• ...

1840

2013

1996

Page 14: Automated Debugging: Are We There Yet?

Symbolic Debugging

• UNIVAC 1100’s FLIT(Fault Location by Interpretive Testing)

• GDB

• DDD

• ...

1840

2013

1996

Page 15: Automated Debugging: Are We There Yet?

Program Slicing

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

1960

2013

1981

Page 16: Automated Debugging: Are We There Yet?

Program Slicing

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

1960

2013

1981

Page 17: Automated Debugging: Are We There Yet?

mid() { int x,y,z,m;1: read(“Enter 3 numbers:”,x,y,z);2: m = z;3: if (y<z)4: if (x<y)5: m = y;6: else if (x<z)7: m = y; // bug8: else9: if (x>y)10: m = y;11: else if (x>z)12: m = x;13: print(“Middle number is:”, m); }

Static Slicing Example

Page 18: Automated Debugging: Are We There Yet?

Program Slicing

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

1960

2013

1981

Page 19: Automated Debugging: Are We There Yet?

Program Slicing

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

1960

2013

19881993

Page 20: Automated Debugging: Are We There Yet?

mid() { int x,y,z,m;1: read(“Enter 3 numbers:”,x,y,z);2: m = z;3: if (y<z)4: if (x<y)5: m = y;6: else if (x<z)7: m = y; // bug8: else9: if (x>y)10: m = y;11: else if (x>z)12: m = x;13: print(“Middle number is:”, m); }

Dynamic Slicing Example

Page 21: Automated Debugging: Are We There Yet?

mid() { int x,y,z,m;1: read(“Enter 3 numbers:”,x,y,z);2: m = z;3: if (y<z)4: if (x<y)5: m = y;6: else if (x<z)7: m = y; // bug8: else9: if (x>y)10: m = y;11: else if (x>z)12: m = x;13: print(“Middle number is:”, m); }

Dynamic Slicing Example

3,3,

5

1,2,

3

3,2,

1

5,5,

5

5,3,

4

2,1,

3

P P P P P F

••••

••

•••••

•••

•••

•••

••

••••

••••

••

Test Cases

Pass/Fail

Page 22: Automated Debugging: Are We There Yet?

mid() { int x,y,z,m;1: read(“Enter 3 numbers:”,x,y,z);2: m = z;3: if (y<z)4: if (x<y)5: m = y;6: else if (x<z)7: m = y; // bug8: else9: if (x>y)10: m = y;11: else if (x>z)12: m = x;13: print(“Middle number is:”, m); }

Dynamic Slicing Example

3,3,

5

1,2,

3

3,2,

1

5,5,

5

5,3,

4

2,1,

3

P P P P P F

••••

••

•••••

•••

•••

•••

••

••••

••••

••

Test Cases

Pass/Fail

Page 23: Automated Debugging: Are We There Yet?

Program Slicing

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

1960

2013

19881993

Page 24: Automated Debugging: Are We There Yet?

• Intuition: developers “slice” backwards when debugging

• Weiser’s breakthrough paper

• Korel and Laski’s dynamic slicing

• Agrawal

• Ko’s Whyline

Program Slicing1960

2013

2008

Page 25: Automated Debugging: Are We There Yet?

Delta Debugging

• Intuition: it’s all about differences!

• Isolates failure causes automatically

• Zeller’s “Yesterday, My Program Worked. Today, It Does Not. Why?”

1960

2013

1999

Page 26: Automated Debugging: Are We There Yet?

Delta Debugging

• Intuition: it’s all about differences!

• Isolates failure causes automatically

• Zeller’s “Yesterday, My Program Worked. Today, It Does Not. Why?”

• Applied in several contexts

1960

2013

1999

Page 27: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Page 28: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Page 29: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Page 30: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Page 31: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Failure cause

……

Page 32: Automated Debugging: Are We There Yet?

✘Today

Yesterday

Failure cause

……

Applied

to programs, i

nputs,

states

, ...

Page 33: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• Liblit’s CBI

• Many others!

1960

2013

2001

Page 34: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• Liblit’s CBI

• Many others!

1960

2013

2001

Page 35: Automated Debugging: Are We There Yet?

Tarantula

mid() { int x,y,z,m;1: read(“Enter 3 numbers:”,x,y,z);2: m = z;3: if (y<z)4: if (x<y)5: m = y;6: else if (x<z)7: m = y; // bug8: else9: if (x>y)10: m = y;11: else if (x>z)12: m = x;13: print(“Middle number is:”, m); }

3,3,

5

1,2,

3

3,2,

1

5,5,

5

5,3,

4

2,1,

3

P P P P P F

••••

••

•••••

•••

•••

•••

••

••••

••••

••

Test Cases

Pass/Fail

susp

iciou

snes

s

0.8

0.50.50.60.00.7

0.00.00.00.00.00.5

0.5

Page 36: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• Liblit’s CBI

• Many others!

1960

2013

2001

Page 37: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• CBI

• Many others!

1960

2013

2003

Page 38: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• CBI

• Ochiai

• Many others!

1960

2013

2006

Page 39: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• CBI

• Ochiai

• Causal inference based

• Many others!

1960

2013

2010

Page 40: Automated Debugging: Are We There Yet?

Statistical Debugging

• Intuition: debugging techniques can leverage multiple executions

• Tarantula

• CBI

• Ochiai

• Causal inference based

• Many others!

1960

2013

...

Workflow integration:

Tarantula, GZoltar,

EzUnit, ...

Page 41: Automated Debugging: Are We There Yet?

2013

Formula-based Debugging (AKA Failure Explanation)

• Intuition: executions can be expressed as formulas that we can reason about

• Darwin

• Cause Clue Clauses

• Error invariants

1960

2009

Page 42: Automated Debugging: Are We There Yet?

Assertion A

Input I Formula

Input = I∧

c1 ∧ c2 ∧ c3 ∧ ...∧

... ∧ cn-2∧cn-1∧ cn∧

A

1

2

3

unsatisfiable

{ ci }

MAX-SATComplement

Page 43: Automated Debugging: Are We There Yet?

2013

Formula-based Debugging (AKA Failure Explanation)

• Intuition: executions can be expressed as formulas that we can reason about

• Darwin

• Cause Clue Clauses

• Error invariants

1960

2009

Page 44: Automated Debugging: Are We There Yet?

Formula-based Debugging (AKA Failure Explanation)

• Intuition: executions can be expressed as formulas that we can reason about

• Darwin

• Bug Assist

• Error invariants

1960

2011

2013

Page 45: Automated Debugging: Are We There Yet?

Formula-based Debugging (AKA Failure Explanation)

• Intuition: executions can be expressed as formulas that we can reason about

• Darwin

• Bug Assist

• Error invariants

1960

2011

2013

Page 46: Automated Debugging: Are We There Yet?

Formula-based Debugging (AKA Failure Explanation)

• Intuition: executions can be expressed as formulas that we can reason about

• Darwin

• Bug Assist

• Error invariants

• Angelic debugging

1960

2011

2013

Page 47: Automated Debugging: Are We There Yet?

Additional Techniques• Contracts (e.g., Meyer et al.)

• Counterexample-based (e.g., Groce et al., Ball et al.)

• Tainting-based (e.g., Leek et al.)

• Debugging of field failures (e.g., Jin et al.)

• Predicate switching (e.g., Zhang et al.)

• Fault localization for multiple faults (e.g., Steimann et al.)

• Debugging of concurrency failures (e.g., Park et al.)

• Automated data structure repair (e.g., Rinard et al.)

• Finding patches with genetic programming

• Domain specific fixes(tests, web pages, comments, concurrency)

• Identifying workarounds/recovery strategies (e.g., Gorla et al.)

• Formula based debugging (e.g., Jose et al., Ermis et al.)

• ...

1960

2013

Not meant to be comprehensive!

Page 48: Automated Debugging: Are We There Yet?

Are We There Yet?Can We Debug at the Push of a Button?

Page 49: Automated Debugging: Are We There Yet?

Automated Debugging(rank based)

…"

1)"

2)"

3)"

4)"

Here$is$a$list$of$places$to$check$out$ Ok,$I$will$check$out$

your$sugges3ons$one$by$one.$

Page 50: Automated Debugging: Are We There Yet?

Automated DebuggingConceptual Model

…"

1)"

2)"

3)"

4)"

✔✔✔

Found&the&bug!&

Page 51: Automated Debugging: Are We There Yet?

0                  20                40                60                80              100

%  of  faulty

 versio

ns

20

40

60

80

100

%  of  program  to  be  examined  to  find  fault

SiemensSpace

Performance of Automated Debugging Techniques

Page 52: Automated Debugging: Are We There Yet?

Mission Accomplished?

100 LOC ➡ 10 LOC

Best result: fault in 10% of the code.Great, but...

10,000 LOC ➡ 1,000 LOC

100,000 LOC ➡ 10,000 LOC

Page 53: Automated Debugging: Are We There Yet?

Mission Accomplished?

100 LOC ➡ 10 LOC

Best result: fault in 10% of the code.Great, but...

10,000 LOC ➡ 1,000 LOC

100,000 LOC ➡ 10,000 LOCMoreover, strong assumptions

Page 54: Automated Debugging: Are We There Yet?

Assumption #1: Programmers exhibit perfect bug understanding

Do you see a bug?

Page 55: Automated Debugging: Are We There Yet?

Assumption #2: Programmers inspect a list linearly and exhaustively

Good for comparison, but is it realistic?

Page 56: Automated Debugging: Are We There Yet?

Assumption #2: Programmers inspect a list linearly and exhaustively

Good for comparison, but is it realistic?

Does the conceptual model make sense?

Have we really evaluated it?

Page 57: Automated Debugging: Are We There Yet?

Where Shall We Go Next?Are We Headed in the Right Direction?

AKA: “Are Automated Debugging Techniques Actually Helping Programmers?” ISSTA 2011Chris Parnin and Alessandro Orso

Page 58: Automated Debugging: Are We There Yet?

What do we know about automated

Studies on tools Human studies

Page 59: Automated Debugging: Are We There Yet?

What do we know about automated

Studies on tools Human studies

Let’s&see…&Over&50&years&of&research&on&automated&debugging.&

1962.&Symbolic&Debugging&(UNIVAC&FLIT)&

1981.%Weiser.%Program%Slicing%

1999.$Delta$Debugging$

2001.%Sta)s)cal%Debugging%

Page 60: Automated Debugging: Are We There Yet?

What do we know about automated

Studies on tools

Human studies

WeiserKusumotoSherwoodKoDeLine

Page 61: Automated Debugging: Are We There Yet?

• What if we gave developers a ranked list of statements?

• How would they use it?

• Would they easily see the bug in the list?

• Would ranking make a difference?

Are these Techniques and Tools Actually Helping Programmers?

Page 62: Automated Debugging: Are We There Yet?

Hypotheses

H1: Programmers who use automated debugging tools will locate bugs faster than programmers who do not use such tools

H2: Effectiveness of automated tools increases with the level of difficulty of the debugging task

H3: Effectiveness of debugging with automated tools is affected by the faulty statement’s rank

Page 63: Automated Debugging: Are We There Yet?

Research QuestionsRQ1: How do developers navigate a list of statements ranked by suspiciousness? In order of suspiciousness or jumping from one statement to the other?

RQ2: Does perfect bug understanding exist? How much effort is involved in inspecting and assessing potentially faulty statements?

RQ3: What are the challenges involved in using automated debugging tools effectively? Can unexpected, emerging strategies be observed?

Page 64: Automated Debugging: Are We There Yet?

Participants:34 developersMS’s StudentsDifferent levels of expertise(low, medium, high)

Experimental Protocol: Setup

…"

1)"

2)"

3)"

4)"

✔✔✔

Page 65: Automated Debugging: Are We There Yet?

Tools•Rank-based tool

(Eclipse plug-in, logging)•Eclipse debugger

Experimental Protocol: Setup

…"

1)"

2)"

3)"

4)"

✔✔✔

Page 66: Automated Debugging: Are We There Yet?

Software subjects:•Tetris (~2.5KLOC)•NanoXML (~4.5KLOC)

Experimental Protocol: Setup

…"

1)"

2)"

3)"

4)"

✔✔✔

Page 67: Automated Debugging: Are We There Yet?

Tetris Bug

(Easier)

Page 68: Automated Debugging: Are We There Yet?

NanoXML Bug

(Harder)

Page 69: Automated Debugging: Are We There Yet?

Software subjects:•Tetris (~2.5KLOC)•NanoXML (~4.5KLOC)

Experimental Protocol: Setup

…"

1)"

2)"

3)"

4)"

✔✔✔

Page 70: Automated Debugging: Are We There Yet?

Tasks:•Fault in Tetris•Fault in NanoXML•30 minutes per task•Questionnaire at the end

Experimental Protocol: Setup

…"

1)"

2)"

3)"

4)"

✔✔✔

Page 71: Automated Debugging: Are We There Yet?

Experimental Protocol: Studies and Groups

Page 72: Automated Debugging: Are We There Yet?

Experimental Protocol: Studies and Groups

A! B!

Study 1

Page 73: Automated Debugging: Are We There Yet?

Experimental Protocol: Studies and Groups

Study 2

C! D!

Rank

!

Rank!

7➡

35

83➡

16

Page 74: Automated Debugging: Are We There Yet?

Study Results

Tetris NanoXML

A

B

C

D

A! B! C! D!

Rank

!

Rank!

Page 75: Automated Debugging: Are We There Yet?

Study Results

Tetris NanoXML

A Not significantly

differentB

Not significantly

differentC

D

A! B! C! D!

Rank

!

Rank!

Page 76: Automated Debugging: Are We There Yet?

Study Results

Tetris NanoXML

A Not significantly

different

Not significantly

differentB

Not significantly

different

Not significantly

differentC Not

significantly different

Not significantly

differentD

Not significantly

different

Not significantly

different

A! B! C! D!

Rank

!

Rank!

Page 77: Automated Debugging: Are We There Yet?

Study Results

Tetris NanoXML

A Significantly different for

high performers

Not significantly

differentB

Significantly different for

high performers

Not significantly

differentC Not

significantly different

Not significantly

differentD

Not significantly

different

Not significantly

different

A! B! C! D!

Rank

!

Rank!

Stratifying participants

Page 78: Automated Debugging: Are We There Yet?

Study Results

Tetris NanoXML

A Significantly different for

high performers

Not significantly

differentB

Significantly different for

high performers

Not significantly

differentC Not

significantly different

Not significantly

differentD

Not significantly

different

Not significantly

different

A! B! C! D!

Rank

!

Rank!

Stratifying Analysis of results and

questionnaires...

Page 79: Automated Debugging: Are We There Yet?

Findings: HypothesesH1: Programmers who use automated debugging tools will locate bugs faster than programmers who do not use such tools

Experts are faster when using the tool ➡ Support for H1 (with caveats)

H2: Effectiveness of automated tools increases with the level of difficulty of the debugging task

The tool did not help harder tasks ➡ No support for H2

H3: Effectiveness of debugging with automated tools is affected by the faulty statement’s rank

Changes in rank have no significant effects ➡ No support for H3

Page 80: Automated Debugging: Are We There Yet?

Findings: RQsRQ1: How do developers navigate a list of statements ranked by suspiciousness? In order of suspiciousness or jumping b/w stmts?Programmers do not visit each statement in the list, they searchRQ2: Does perfect bug understanding exist? How much effort is involved in inspecting and assessing potentially faulty statements?Perfect bug understanding is generally not a realistic assumptionRQ3: What are the challenges involved in using automated debugging tools effectively? Can unexpected, emerging strategies be observed?1) The statements in the list were sometimes useful as starting points2) (Tetris) Several participants preferred to search based on intuition3) (NanoXML) Several participants gave up on the tool after investigating too many false positives

Page 81: Automated Debugging: Are We There Yet?

Research Implications• Percentages will not cut it (e.g., 1.8% == 83rd position)

➡ Implication 1: Techniques should focus on improving absolute rank rather than percentage rank

• Ranking can be successfully combined with search➡ Implication 2: Future tools may focus on searching through (or automatically highlighting) certain suspicious statements

• Developers want explanations, not recommendations➡ Implication 3: We should move away from pure ranking and define techniques that provide context and ability to explore

• We must grow the ecosystem➡ Implication 4: We should aim to create an ecosystem that provides the entire tool chain for fault localization, including managing and orchestrating test cases

Page 82: Automated Debugging: Are We There Yet?

• We came a long way since the early days of debugging

• There is still a long way to go...

In Summary

...

Page 83: Automated Debugging: Are We There Yet?

Where Shall We Go Next•Hybrid, semi-automated fault localization techniques•Debugging of field failures (with limited information)•Failure understanding and explanation•(Semi-)automated repair and workarounds

•User studies, user studies, user studies! (true also for other areas)

Page 84: Automated Debugging: Are We There Yet?

With much appreciated input/contributions from

• Andy Ko

•Wei Jin

• Jim Jones

•Wes Masri

• Chris Parnin

• Abhik Roychoudhury

•Wes Weimer

• Tao Xie

• Andreas Zeller

• Xiangyu Zhang