Sahara icsm 2011

Sahara: Guiding the Debugging of

Failed Software Upgrades

Rekha Bachwani, Olivier Crameri, Ricardo Bianchini, Dejan Kostic, Willy Zwaenepoel

Modern software is complex and requires regular

updates (once every few weeks)

Fix bugs

Patch security vulnerabilities

Software upgrade failures are frequent [sosp’07]

5-10% of all upgrades fail

Upgrade failures can be catastrophic

Service disruption ($$)

User dissatisfaction ( )

ICSM, 2011 Rekha Bachwani, Rutgers University 2

Motivation

Difference in vendor and user environment is a major

source of failures [sosp’07]

Broken dependencies

Incompatibilities with the legacy systems

Testing in all user environments is impractical

Set of all possible environment settings is large

Set of possible user inputs is huge

Debugging software upgrade failures is hard

Incomplete user environment data

Unable to reproduce user conditions or failure


Motivation

Integrate users in the testing environment

Test upgrade in (many) user environments with their input

Collect data from (willing) users

Environment settings

Success or failure flags

Leverage data from the users to isolate the cause


Approach

Sahara: Upgrade Debugging System

Simplifies debugging of environment-related failures

Prioritizes the set of routines to consider when debugging

Uses machine learning, and static and dynamic analyses

Evaluate Sahara with 3 applications (5 failures)

Three real upgrade failures in OpenSSH

One synthetic failure each in SQLite and uServer


Contributions


Outline

Overview

Sahara: Debugging Failed Upgrades

Evaluation

Conclusion


Sahara - Key Idea

Upgrade failures are caused by user environments

Identify the suspect environment resources (SERs)

Identify the code affected by SERs

Software behaved correctly before the upgrade

Identify the code deviations in the upgrade

The root cause is most likely in the code that is both

Affected by the suspect aspects of the environment

Has deviated after the upgrade


Sahara - Identifying Suspects

Upgrade

Upgrade

Upgrade

Test

Upgrade

Test

Upgrade

Test

Upgrade

User Sites

Vendor


Sahara - Identifying Suspects

Pass/Fail labels

Environment data

Suspect

Environment

Resources

Feature Selection

Static AnalysisSuspect

Routines

User Sites

Vendor


Sahara – Identifying Deviations

Instrumented

code

Dynamic

analysis code

Run

Original

and New

Version

Run

Original

and New

Version

Suspect

Routines

Vendor

Instrumented

code

Dynamic

analysis code

User Sites



Run

Original

and New

Version

Run

Original

and New

Version

Suspect

Routines

Vendor

Dynamic

Analysis

Dynamic Analysis

Deviated

Routines

User Sites



Suspect

Routines ∩ Prime Suspects

Vendor

Deviated

Routines


Sahara - Summary

Identifies environment resources that caused failure

Feature selection using feedback from many users

Isolates routines affected by suspect environment

Def-use static analysis

Finds routines that have deviated after upgrade

Dynamic source analysis [icsm’04]

Combines results from static and dynamic analysis to

produce prime suspects


Outline

Overview

Sahara: Debugging Failed Upgrades

Evaluation

Conclusion

Experimental Setup

Upgrade deployment Environment data from 87 machines

Experiments

Application-specific configuration

Random

Modified 3 out of 8 real configurations to induce failures

Feature Selection 20 fail profiles, 67 success profiles

Failure Correlation

Perfect (100%) – all failure-inducing profiles result in failure

Imperfect (60%) – 60% of failure-inducing profiles result in failure

Imperfect (20%) – 20% of failure-inducing profiles result in failure

Suspects: features within 30% of the top-ranked feature

ICSM, 2011 15Rekha Bachwani, Rutgers University

Evaluation

Evaluate Sahara with three applications:

OpenSSH – 3 real upgrade failures

Upgrades every 3-6 months

50-70K lines of code

SQLite – 1 synthetic upgrade failure

67K lines of code

uServer – 1 synthetic upgrade failure

37K lines of code

Results for only two OpenSSH bugs discussed next


OpenSSH bugs – Port Forwarding

Large data transfers abort with port forwarding

Regression bug in ssh version 4.7

Abort not reproducible at vendor site

Reasons for the abort

Users with port forwarding enabled issued large transfers

Default window size increased from 128KB to 2MB

Window size incorrectly advertised as packet size

sshd limits maximum packet size to 256KB


Results – Port Forwarding

Sahara reduces no. of routines by

2-3x over static analysis, 17-20x over dynamic analysis, and 9-10x over diff

Produces small number of routines

Prime suspects always include offending routine(s)


65 65 65 65 65 65

12 12 12

22 22 22

124 124 124 124 124 124

6 6 6 7 7 7

0

40

80

120

Perfect (100%) (SERs = 1)

Imperfect (60%) (SERs = 1)





Random Real

diff Suspect Routines (Static Analysis)

Deviated Routines (Dynamic Analysis) Primary Suspects (Sahara)

No

. o

f R

ou

tin

es

OpenSSH bugs – X11 Forwarding

X forwarding won't start when executed in background

Regression bug in sshd version 4.2

X11 forwarding is enabled and X-session started in the

background

Reasons for the failure

X11 code modified to destroy listeners whose session has ended

X11 session in background, closes the session


Results – X11 Forwarding


137 137 137 137 137 137

18 18 18 21 20 20

157 157 157 157 157 157

6 6 6 7 6 60

40

80

120

160







Random Real

diff Suspect Routines (Static Analysis)

Deviated Routines (Dynamic Analysis) Primary Suspects (Sahara)

No

. o

f R

ou

tin

es

Sahara reduces no. of routines by

3x over static analysis, 20x over dynamic analysis, and 15x over diff

Produces small number of routines

Offending routine(s) always included in prime suspects

Results – Sensitivity Analysis (1/2)

Impact of number of failure-inducing profiles

Default - 20 failure-inducing profiles

Case 1 - 30 failure-inducing profiles

Number of SERs reduces by at most 2 features

Number of prime suspects reduces by at most 1

Case 2 - 10 failure-inducing profiles

Number of SERs reduces by at most 1

Number of prime suspects reduces by at most 1


More profiles result in fewer SERs and prime suspects

Fewer profiles sometimes result in less noise

Results – Sensitivity Analysis (2/2)

Impact of feature selection accuracy

Default – suspects within 30% of top-ranked feature

Case 1 - suspects within 50% of top-ranked feature

Prime suspects increase by at most 2x

Case 2 – All configuration parameters are suspect

Prime suspects increase by 6-7x


Lower feature selection accuracy results in more

SERs and prime suspects


Conclusion

Leverages user feedback, machine learning, and

program analyses

Produces accurate recommendations with a small set

of routines

The recommended set always includes

the offending routine

culprit environment resource

Demonstrates that combining different techniques can

be effective for debugging

Thanks for your time!

Questions ?

Technology

Sahara icsm 2011