32
A Case for an Open Source Data Repository Archana Ganapathi Department of EECS, UC Berkeley ([email protected] )

A Case for an Open Source Data Repository

Embed Size (px)

Citation preview

Page 1: A Case for an Open Source Data Repository

A Case for an Open Source Data Repository

Archana GanapathiDepartment of EECS, UC

Berkeley([email protected])

Page 2: A Case for an Open Source Data Repository

Why do we study failure data? Understand cause->effect relationship

between configurations and system behavior

Still don’t have a complete understanding of failures in systems Can’t worry about fixing problems if we don’t

understand them in the first place Gauge behavioral changes over time Need realistic workload/faultload data to

test/evaluate systems Success stories…people have benefited

from failure data analysis

Page 3: A Case for an Open Source Data Repository

Crash data collection success stories

Berkeley EECS BOINC 2 Unnamed Companies

Page 4: A Case for an Open Source Data Repository

…So Why Does Windows Crash?

Page 5: A Case for an Open Source Data Repository

Definitions Crash

Event caused by a problem in the operating system(OS) or application(app) Requires OS or app restart.

Application Crash A crash occurring at user-level, caused by one or more components

(.exe/.dll files) Requires an application restart.

Application Hang An application crash caused as a result of the user terminating a process

that is potentially deadlocked or running an infinite loop. Component (.exe/.dll file routing) causing the loop/deadlock cannot be

identified (yet) OS Crash

A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines.

Blue-screen-generating crashes require a machine reboot Windows explorer crashes require restarting the explorer process.

Bluescreen An OS crash that produces a user-visible blue screen followed by a non-

optional machine reboot.

Page 6: A Case for an Open Source Data Repository

Procedure Collect crash dumps from two different sources

UC Berkeley EECS department BOINC volunteers

Filter data/form crash clusters to avoid double-counting

Account for shared resources, dependent processes, system instability, user retry

Parse/Interpret crash dumps using Debugging tools for Windows

Study both application crash behavior and operating systems crashes

Supplement crash data with usage data

Page 7: A Case for an Open Source Data Repository

EECS Dataset

Type of UserNumber of

UsersNumber of

Crashes

graduate student 30 621

staff 28 414

unknown 16 197

faculty 14 191

undergraduate 4 19

visitor 3 51

guest 1 9

postdoc 1 19

TOTAL 97 1521

Page 8: A Case for an Open Source Data Repository

Crashes reported per month

Number of Crashes per Month

54

204

248

282

320

220

184201

191

237

113

0

50

100

150

200

250

300

350

Jun 14-30, 2004

Jul 1-31, 2004

Aug 1-31, 2004

Sep 1-30, 2004

Oct 1-31, 2004

Nov 1-30, 2004

Dec 1-31, 2004

Jan 1-31, 2005

Feb 1-28, 2005

Mar 1-31, 2005

Apr 1-14, 2005

Month

# c

ras

he

s

Page 9: A Case for an Open Source Data Repository

Usage/Crashes per day of week

Percentage of Computer Users per Day of Week

0

20

40

60

80

100

120

Mon

day

Tuesd

ay

Wed

nesd

ay

Thurs

day

Friday

Satur

day

Sunda

y

Day of Week

% u

se

rs

Number of Crashes per Day of Week

446

371

482

410

297

116 132

050

100

150200250300350

400450500

Mon

day

Tuesd

ay

Wed

nesd

ay

Thurs

day

Friday

Satur

day

Sunda

y

Day of Week

# c

ras

he

s

EECS department users use their EECS computers Monday through Friday.

Few users use computers on weekends.

Crashes do not occur uniformly across the five days of the working week.

Page 10: A Case for an Open Source Data Repository

Usage/Crashes per hour of day

Percentage of Computer Users per Hour of Day

0

20

40

60

80

100

12

am

-12

:59

am

1:0

0a

m-1

:59

am

2:0

0a

m-2

:59

am

3:0

0a

m-3

:59

am

4:0

0a

m-4

:59

am

5:0

0a

m-5

:59

am

6:0

0a

m-6

:59

am

7:0

0a

m-7

:59

am

8:0

0a

m-8

:59

am

9:0

0a

m-9

:59

am

10

:00

am

-10

:59a

m

11

:00

am

-11

:59a

m

12

:00

pm

-12

:59p

m

1:0

0p

m-1

:59

pm

2:0

0p

m-2

:59

pm

3:0

0p

m-3

:59

pm

4:0

0p

m-4

:59

pm

5:0

0p

m-5

:59

pm

6:0

0p

m-6

:59

pm

7:0

0p

m-7

:59

pm

8:0

0p

m-8

:59

pm

9:0

0p

m-9

:59

pm

10

:00

pm

-10

:59p

m

11

:00

pm

-11

:59p

m

Hour of Day

% u

se

rs

Number of Crashes per Hour of Day

2113

4 2 4 2 826

114

161 164 171 167

214196

208 204

139

10795

87

5843 46

0

50

100

150

200

250

12

am

-12

:59

am

1:0

0a

m-1

:59

am

2:0

0a

m-2

:59

am

3:0

0a

m-3

:59

am

4:0

0a

m-4

:59

am

5:0

0a

m-5

:59

am

6:0

0a

m-6

:59

am

7:0

0a

m-7

:59

am

8:0

0a

m-8

:59

am

9:0

0a

m-9

:59

am

10

:00

am

-10

:59

am

11

:00

am

-11

:59

am

12

:00

pm

-12

:59

pm

1:0

0p

m-1

:59

pm

2:0

0p

m-2

:59

pm

3:0

0p

m-3

:59

pm

4:0

0p

m-4

:59

pm

5:0

0p

m-5

:59

pm

6:0

0p

m-6

:59

pm

7:0

0p

m-7

:59

pm

8:0

0p

m-8

:59

pm

9:0

0p

m-9

:59

pm

10

:00

pm

-10

:59

pm

11

:00

pm

-11

:59

pm

Hour of Day

# cr

ash

es

Most people work during the typical hours of 9am to 5pm.

Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules

Page 11: A Case for an Open Source Data Repository

Reboot FrequencyPercentage of Users Rebooting their Computer at

Specified Frequency

0

5

10

15

20

25

30

1 2.5 5 7 10 14 30 60 365

interval (days)

Per

cen

tag

e o

f u

sers

Page 12: A Case for an Open Source Data Repository

Automatic Clustering Experiment for Categorizing Apps Augment the crash data with information about

usage patterns and program dependencies Feed data into the k-means and agglomerative

clustering algorithms to determine which applications are behaviorally related.

We determined that we did not have enough data to derive a method to categorize applications in our data set

Need several instances of every (application, component, error code) combo

As a last resort, we chose to categorize apps based on categorization based on application functionality

Page 13: A Case for an Open Source Data Repository

Crash Cause by Application Category

Application Category # Crashes Crash % Usage %

web browsing 598 41% 18%

unknown 185 13% n/a

document preparation 152 11% 22%

email 130 9% 24%

scientific computing 95 7% 7%

document viewer 84 6% 8%

multimedia 57 4% 6%

code development 26 2% 10%

document archiving 23 2% n/a

remote connection 23 2% n/a

instant messaging 17 1% n/a

i/o 15 1% n/a

other 14 1% 1%

database 8 1% n/a

system management 8 1% 4%

security 7 0% n/a

Page 14: A Case for an Open Source Data Repository

Application Hang vs Crashes due to Faulty Component

Crash Cause

application hang48%

faulty component

52%

Page 15: A Case for an Open Source Data Repository

Which applications hang?Application # hangs % hangs

% Running Total

iexplore.exe 185 25% 25%

matlab.exe 68 9% 34%

winword.exe 67 9% 43%

outlook.exe 60 8% 51%

firefox.exe 47 6% 57%

netscape.exe 41 6% 63%

unknown 25 3% 66%

powerarc.exe 19 3% 69%

powerpnt.exe 13 2% 71%

thunderbird.exe 13 2% 73%

excel.exe 12 2% 75%

acrobat.exe 11 1% 76%

explorer.exe 11 1% 77%

mozilla.exe 11 1% 78%

acrord32.exe 10 1% 79%

msimn.exe 10 1% 80%

AdDestroyer.exe 7 1% 81%

wmplayer.exe 7 1% 82%

notepad.exe 6 1% 83%

rundll32.exe 5 1% 84%

hp precisionscan pro.exe 4 1% 85%

mathematica.exe 4 1% 86%

msaccess.exe 4 1% 87%

msdev.exe 4 1% 88%

photosle.exe 4 1% 89%

winamp.exe 4 1% 90%

apps causing <1% of crashes each 84 11% 101%

Total 736

Page 16: A Case for an Open Source Data Repository

Which components cause crashes?

Component Description Author Apps invoking component

%crash

ntdll.dll

NT system functions MS Internet Explorer, Matlab

11% (86)

msvcrt.dll Microsoft C runtime library MS Acrobat,

Netscape 5% (37)

acrord32.exe

Acrobat Reader 3rd part

y

Acrobat Reader

4% (29)

pdm.dll

Scripting component functions

MS Visual Studio, Internet Explorer 3% (23)

firefox.exe

Web browser 3rd part

y

Firefox

2% (19)

user32.dll Communication, message

handler, timer functions MS Firefox, Internet

Explorer 2% (17)

ray_tracing.exe

User application 3rd part

y

--

2% (16)

winword.exe Windows document editor MS Word, Outlook 2% (15)

mshtml.dll HTML related functions MS

Internet Explorer, Netscape 2% (15)

tempest.exe Unknown

3rd part

y

--

2% (15)

gklayout.dll Mozilla layout library

3rd part

y

Thunderbird, Firefox

2% (14)

kernel32.dll

Microsoft memory management, I/O and interrupts library MS

Acrobat, Firefox, Internet Explorer 2% (14)

simpl_fox_gl.exe

User application 3rd part

y

--

2% (14)

rpcl3260.dll Real Player component

3rd part

y

Real Player

2% (13)

thunderbird.exe Mozilla e-mail program

3rd part

y

Thunderbird

2% (13)

Page 17: A Case for an Open Source Data Repository

BOINChttp://winerror.cs.berkeley.edu/crashcollection/

Berkeley Open Infrastructure for Network Computing

Users download boinc client app Crash dumps are scraped/sent to boinc

servers Currently 791 accounts created for

crash collection + resource management 492 users for crash collection

Page 18: A Case for an Open Source Data Repository

OS Crashes Driver faults

asynchronous events code must follow kernel programming

etiquette exceedingly difficult to debug

Memory corruption Hardware problems (e.g. non-ECC mem) Software-related 47 of these in our dataset so far…don’t have

tools to analyze these in detail

Page 19: A Case for an Open Source Data Repository

OS crash causing images(based on 150 boinc users, 562 crashes)

Image Name Image DescriptionNum

crashes

% crashes

% Running Total

ntoskrnl.exe NT kernel and system 150 27% 27%

GDFSHK.SYS McAfee Privacy Service File

Guardian 42 8% 35%

ALCXWDM.SYS

Windows (R) WDM driver for Realtek AC'97 40 7% 42%

kmixer.sys kernel audio mixer of Microsoft

Windows 28 5% 47%

win32k.sys multi user win32 driver 19 3% 50%

ati3d2ag.dll ATI Technologies Inc. Radeon

DirectX Universal Driver 18 3% 53%

Brwgate.sys NAT/Proxy/Firewall system 16 3% 56%

HSF_CNXT.sys

Conexant Systems SoftK or SoftK56 Modem Driver 10 2% 58%

Ialmdev5.DLL Intel graphics driver 10 2% 60%

ati2dvag.dll ATI Radeon WinNT display

driver 8 1% 61%

nv4_disp.dll NVIDIA Compatible Windows

2000 display driver 8 1% 62%

V7.SYS IBM V7 Driver for Windows NT/2000

8 1% 63%

usbscan.sys Microsoft usb driver 7 1% 64%

ALCXSENS.SYS

Windows (R) WDM driver for Realtek AC'97 6 1% 65%

ar5211.sys driver for dual band WIFI

wireless mini pci adapter 6 1% 66%

pcx500.sys NDIS5.1 Miniport Driver for 32

bit Windows 6 1% 67%

Unknown_Image -- 6 1% 68%

ati3duag.dll ATI Technologies Inc. Radeon

DirectX Universal Driver 5 1% 69%

AVGNTDD.SYS

Filter Device for Windows XP/2000/NT 5 1% 70%

nv4_mini.sys NVIDIA Compatible Windows

2000 Miniport Driver 5 1% 71%

Page 20: A Case for an Open Source Data Repository

Crash generating driver fault type

Driver Fault Type Num Crashes

PAGE FAULT IN NONPAGED AREA 118

IRQL NOT LESS OR EQUAL 105

KERNEL MODE EXCEPTION NOT HANDLED 67

UNEXPECTED KERNEL MODE TRAP 63

BAD POOL CALLER 46

THREAD STUCK IN DEVICE DRIVER 36

SYSTEM THREAD EXCEPTION NOT HANDLED 29

Unknown bugcheck code 16

Other (each caused 1 crash) 14

PFN LIST CORRUPT 13

DRIVER CORRUPTED EXPOOL 12

DRIVER UNLOADED WITHOUT CANCELLING PENDING OPERATIONS 8

MANUALLY INITIATED CRASH 5

File Corruption - Unreadable File 4

BAD POOL HEADER 4

KERNEL DATA INPAGE ERROR 4

NTFS FILE SYSTEM 4

CRITICAL OBJECT TERMINATION 3

FAT FILE SYSTEM 3

DRIVER POWER STATE FAILURE 2

KERNEL STACK INPAGE ERROR 2

MEMORY MANAGEMENT 2

MULTIPLE IRP COMPLETE REQUESTS 2

Page 21: A Case for an Open Source Data Repository

Summary of crash analysis Application crashes are caused by both

faulty non-robust dll files as well as impatient users

OS crashes are predominantly caused by poorly-written device driver code

Commonly used core components are blamed for most crashes need to improve reliability of these

components

Page 22: A Case for an Open Source Data Repository

Practical techniques to reduce crashes

Software-Based Fault Isolation Nooks Separate protection level for

drivers Move driver code to user libraries Virtual Machine for each

unsafe/distrusted app

Page 23: A Case for an Open Source Data Repository

Lessons from crash data study

Clearly people want to know what’s wrong and how to fix it

The more feedback we give, the more data sets we receive

...but it’s not as easy as it sounds

Page 24: A Case for an Open Source Data Repository

What kinds of data should we collect?

Failure data Configuration information Logs of normal behavior Usage data Performance logs Annotations of data Collect data for Individual

Machines + Services

Page 25: A Case for an Open Source Data Repository

Why are people afraid of sharing data?

Fear of public humiliation (reverse engineering what user was doing)

Revealing problems within their organization Fear of competitors using data against them Revealing loopholes through which malware

can easily propagate. Revealing dependability problems in third

party products (MS)

Page 26: A Case for an Open Source Data Repository

Non-technical challenges to getting data

Collecting (useful) data is tedious What information is “necessary and sufficient”

to understand data trends? Privacy concerns

Especially with usage data Finding the person with access to data

No central location that can be queried for data

Legal agreements take a long time to draft Researchers are more willing to share data

than lawyers Publicity

Page 27: A Case for an Open Source Data Repository

Technical solution

Amortize the cost of data collection by building an open source repository

Provide a set of tools to cleanse and mine the data

Page 28: A Case for an Open Source Data Repository

What tools should we implement?

Collect BOINC Instrumentation (MS, Pinpoint) Pre-aggregated data from companies

Anonymize/Preprocess Pre-written anonymization tools Company-specific privacy requirements

Hash values of certain fields Drop irrelevent fields Mask part of data

Page 29: A Case for an Open Source Data Repository

Tools cont’d Store

Open-source repository schema Common log format/ data descriptor headers Tools to convert log metadata to common

format to cross-link data tables Sample queries: data mining ~ asking questions

about data as it is Analyze/Experiment

SLT algorithms Visualization Stream processing Other tools (eg. WinDbg)

Page 30: A Case for an Open Source Data Repository

Thoughts on Collection/Anonymization

Defining necessary and sufficient Bad example: Cannot correlate crashes if

we getting rid of all user/machine names Good example: Hash user/machine

names Default: hide if not necessary? What would it take for you not to

invoke the legal dept?

Page 31: A Case for an Open Source Data Repository

Thoughts on Storage/Analysis

Use time/data source as primary key?

How domain-specific should the common format be?

Management logistics… Access control…

Page 32: A Case for an Open Source Data Repository

Acronym Suggestions???

Open Source (Failure) Data Repository