115
What We Can Learn from Big Bugs that Got Away Ken Johnston, Group Manager Office, Internet Platforms & Operation EuroSTAR 2010

Ken Johnston - Big Bugs That Got Away - EuroSTAR 2010

Embed Size (px)

DESCRIPTION

EuroSTAR Software Testing Conference 2010 presentation on Big Bugs That Got Away by Ken Johnston . See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/

Citation preview

Page 1: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

What We Can Learn from Big

Bugs that Got Away

Ken Johnston, Group ManagerOffice, Internet Platforms & Operation

EuroSTAR 2010

Page 2: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

I Want to know more about YOU

• Who wandered in here by accident

• Who is at EuroSTAR for the first time

• How long have you been in Software Testing

• Have you ever missed a bug

• Have you ever heard…

Page 3: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

“HOW COULD

YOU MISS

THAT BUG!!!”

Page 4: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Def. – Rolling around in something disgusting

Page 5: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Ken’s Big

Bug Story

Page 6: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

It all began one dark and stormy night!

Page 7: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010
Page 8: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Session Overview

• About you, me and setting the tone• Bug Wallowing 1 – A self reflective journey• Bug Wallowing 2 – Group Therapy• Root Cause Analysis 101

▫ Sentinel Events▫ Pattern Analysis▫ Formal RCA program overview

• Bug Wallowing 3• Five Whys• Bug Wallow 4• Fishbone• Bug Wallowing 5• Crafting a good bug story

P

P

Page 9: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Learning Objectives

1. Be armed to deal with the question, “How did test miss this bug.”

2. Learn a little about formal RCA and the use of the 5 Whys and Fishbone tools

3. Have a number of highly instructive bug stories from within your organization that you can take home

Page 10: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Def. – Roll in something: to lie down and roll around in something

Page 11: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

“HOW COULD

YOU MISS

THAT BUG!!!”

Page 12: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time for some “Group Bug” Therapy

Page 13: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Repeat After Me

• I did not design the bug.

• I did not code the bug.

• I found crashing bugs, data corruption bugs, fit and finish bugs.

• I found hundreds of bugs.

Page 14: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Repeat After Me

•So what if I missed a bug.

• I didn’t write the bug in the first place.

Page 15: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Activity Share your Bug Story

• Take the next 10 minutes

• Groups of 2 or 3

• Think of a bug that got away

• Minimum One Bug story each

• Questions to ask

▫ How long after ship did you see this

▫ How big was the impact

▫ How did it get missed

▫ What did you change because of this bug

Page 16: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

That’s Time

Page 17: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time to Share

• Next 5 minutes or so

• Did you have any Ah Ha moments?

Page 18: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Why do we Wallow in Bugs that got away?

• Take 3-5 minutes to discuss in your groups

Page 20: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time to Share

• What did you come up with?• Why do we wallow?• Why do we RCA bugs?• My List

▫ To learn from mistakes▫ To systematically identify

areas for improvement▫ To prevent repetition of

mistakes▫ Bugs are stories and

organizations are driven by the stories they tell

Page 21: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

First we need a commonbaseline to work from

Page 22: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Root Cause Analysis 300 Level• Two approaches to RCA

▫ Sentinel Event

▫ Pattern Analysis

• Formal RCA Program▫ Data Collection

▫ Data Analysis and Assessment

▫ Corrective Actions

• The Pit and the Pendulum

▫ Risks of RCA

▫ Benefits of RCA

Based upon Ch. 11PDF available to EuroSTARattendeeshttp://defectprevention.org

Page 23: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA –Sentinel Event Bugs

• How do you know it’s a Sentinel Event Bug?

• If you make the front page of the http://wsj.com

• Production Outage▫ I have a lot of these stories

• Security vulnerabilities

• The last bug taken before ship▫ “How could we have missed this!”

• Any big bug that got away

• Nothing to do with the X-Men

Page 24: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA Pattern Analysis

• Pattern Analysis requires a lot of bugs

• Pattern Analysis can be done over time

• Pattern Analysis is best served within a formal RCA Program.

▫ Cut some of the slides from this presentation

▫ The full set of slides can be found in the appendix on the EuroSTAR conference website

Page 25: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phases of an RCA Program

1. Event Identification

2. Data Collection

3. Data Analysis and Assessment

4. Corrective Action

5. Inform and Apply

6. Follow-up, measurement and reporting

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

P

Page 26: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase 2: Data CollectionExercise

• Data Channels 5 Minute Discussion in Groups▫ What are the sources of data in my

organization▫ Which are practical▫ Which are the most costly to

implement▫ Which are most likely to yield results▫ Do you have time to implement these

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 27: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

That’s time

Page 28: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase 2: Data Collection Time to Share

• What sources did you come up with?

Page 29: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase 2: Data Collection(Sources of Data)

• Defect and Test Case Management tracking system• Source code repository and Test code coverage

data• Voice of the Customer

▫ Product support and Customer or marketing data▫ Individual surveys and interviews

• Findings from previous RCA Studies• Crash data through Windows Error Reporting• Services have tickets and data center telemetry

▫ Heuristic Data of live site now vs. historic

More about WER @ https://winqual.microsoft.com/

Page 30: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase 2: Data Collection(Tracking System)

• Prepare a list of Sentinel Events• Gather and Prepare the Preliminary Data• Route Single Event through Process• Create an RCA Tracking Database

Data Elements of RCA Tracking System

• Event or Study ID, Title & Dates

• Related Defect links

• Failure areas and Source Code

• Timeline of events before and after (vital for services)

• Team Contacts and Owners

• RCA Analysts and Contacts

• Expert Groups and Contacts

• Cause of defect and corrective action

• Survey Data and Results on effectiveness of corrective action

• Log Events in RCA system• Analyze events

• NOTE: Meta Data better suited for lists, documents and shares

Page 31: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone)

Good article from ASQ –http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 32: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys)

• Brief History - http://en.wikipedia.org/wiki/5_Whys

▫ Developed by Sakichi Toyoda▫ First used in Toyota Motor Corporation▫ Common tool within Kaizen, Lean Manufacturing & Six Sigma

• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem

• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html

▫ why was the website down? The CPU utilization on all our front-end servers went to 100%

▫ why did the CPU usage spike? A new bit of code contained an infinite loop!

▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the

feature▫ why didn't he write a unit test? He's a new employee, and he was not

properly trained in TDD

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 33: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Def. – indulge in something excessively: to take pleasure or be immersed in something in a self-indulgent way

Page 34: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Insert Bug Story Videos

Page 35: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Five Whys Exercise

• Take 5-10 minutes• Use one of these bugs or one

of your own• Try the five whys and see if

you can find a root cause

Page 36: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

That’s time

One does not worry

about grace or dignity

Page 37: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time to Share

• Time for about 2 examples

• What about the 5 Whys worked for you

• Where did it fall short?

Page 38: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys)

•Criticism of five whys▫Not reproducible across

individuals▫Shown that investigators tent do

stop a symptoms rather than root cause

▫Relies upon the investigators knowledge

Page 39: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

• Brief History - http://en.wikipedia.org/wiki/Ishikawa_diagram

▫ Developed by Kaoru Ishikawa in the 1960s

▫ One of the 7 basic quality management tools

• Can use with 5 whys

▫ Put each why off the first tree point

▫ Ask why for each one of these issues

▫ Keep going until you find one or more root causes

• Some industries have common causes mapped to the fishbone

▫ Original 4 Ms – Machine, Method, Material, Man power

▫ The 8 Ps (Used in Service Industry) – People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product

▫ Ken’s List – People& Training, Tools, Inspection and supervision, Pressure or Stress, Process & Accountability, Recognition & Awareness

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Phase III: Data Analysis and Assessment(Fishbone Diagram)

Page 40: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Pressure or Stress

Recognition & Awareness Process & AccountabilityTools

Inspection & SupervisionPeople & Training

Brownout across 3 largest datacenters

Page 41: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

• Deployment tool changes

▫ Warn but do not prevent multi-DC deployments

▫ Automatically generate rollback script

▫ Cross service monitors will cancel and roll back a bad deployment automatically • Process changes

▫ Deployment code review

▫ Deployment checklist

▫ Audits and Fire drills

Audited all alerts, escalation aliases and contact #s

Fire drill email and phone

• New Tools

▫ Per-Alert fault injection

• Recognition

▫ SWAT DRI team for most senior DRIs

Page 42: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Fishbone Exercise

• Take 5-10 minutes• Have a handout for you• Use the same bug from the

five whys exercise

Page 43: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

That’s time

Page 44: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time to Share

• Time to share▫ Who did the same bug as

the five whys?

▫ Who did a different bug?

• What about the fishbone worked for you?

• Where did it fall short?

Page 45: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Fishbone)

•Criticism of Fishbone▫Requires a lot of experts for

each branch

▫Cumbersome

Page 46: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase V: Inform and Apply

• Host a Management Review

▫ Managers will like RCA more than bugs

▫ You are eliminating a problem not just finding it

• Implementation is a project, treat it that way▫ Assign Owners

▫ Build and Maintain Schedule

▫ Create a Feedback Loop

▫ Establish a Monthly Status Report

▫ Track and correct the corrective action

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 47: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase VI: Follow-up, Measurement, and

Reporting• More than Just

• Six Sigma type approaches• Longitudinal Analysis

▫ Draws from Longitudinal Data Analysis -http://gseacademic.harvard.edu/alda/

▫ Study Over Time • Develop failure types and risk areas/components• Inspect similar products/areas for baseline• Gather and inspect process data• Examine Data for Trends• Report out

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up, Measurement, and Reporting

Page 48: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Def. – have huge amount of something: to have an ample or excessive supply of something

Page 49: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA Pit and Pendulum

Page 50: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Risks of Root Cause Analysis

• Begins with inadequate data

• Go after too much data too early

• Draws incorrect conclusion or makes invalid recommendations▫ Anyone experience this before

• Focus on the wrong set of defects

• Ends at the wrong level – too early or late

• Investment is not always predictable▫ Can be high cost with low ROI

• Over focus on data can detract from the story

Page 51: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Benefits of Structured RCA Study

• Can start as small pilots

• Uses an identical process regardless of type, age or scope of defect

• Avoids repeat failures

• Can be the shortest path to determining and correcting causes of failure

• Lowers Maintenance Costs

• Builds a culture of ▫ Accountability

▫ Continuous Improvement

Page 52: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Achieve Balance

• Full Blow RCA with large pattern analysis rarely meets ROI goals.

• Limit the scope▫ Few Data Sources

▫ Beware of the RCA Tax• Focus on Sentinel Events

▫ Provides opportunity for clear visible winds

▫ If it’s a bug that got away you’ll be doing a Post Mortem anyway

▫ Sentinel events provide an opportunity to change the dialogue

Page 53: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

I’ve had enough

Page 54: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Telling a Tall Tale

Page 55: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

So why a focus on Bugs that got away

• Bugs that got away are Sentinel Events

• They are great stories▫ There is never an end to bugs

• Bug Stories are Organizational Knowledge

• Tribal Knowledge drives organizations

• Stories are powerful change enablers

Page 56: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Stories Work!

Biographies

Allegories

Page 57: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Gloves on the Boardroom Table

• The Heart of Change▫ Requires an emotional

component▫ What is more emotional than

“How could test miss this bug!”

• Not all change stories involve yelling

• Visual and tactile help too▫ Handout of “Gloves on the

boardroom table”▫ [email protected] “I love your idea. And you have

my permission.”

Page 58: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Organizational Development

• I worked in Engineering Excellence▫ We were Performance Improvement organization

▫ Enterprise Change Management

• Let me bring in some OD concepts

Page 59: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Knowledge Management (KM)

comprises a range of practices used in an organization to identify, create, represent, distribute and enable adoption of insights and experiences.

Such insights and experiences comprise knowledge, either embodied in

individuals or embedded in

organizational processes or practice.

http://en.wikipedia.org/wiki/Knowledge_management

Page 60: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

What are Organizations Made of?

PEOPLE

Page 61: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

What do people do?

Talk about stuff

Page 62: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Tribal Knowledge

Institutional memory is a collective set of

facts, concepts, experiences and know-how held

by a group of people.

http://en.wikipedia.org/wiki/Institutional_memory

Page 63: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Organizational StorytellingThe study of organizational storytelling, sometimes called

“Narrative Knowledge,” attempts to

recount events in the form of a storywithin the context of an organizationhttp://en.wikipedia.org/wiki/Organizational_Storytelling

Page 64: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

So, what is a bug story?

be part of the Organizational

Narrative Knowledge

that should…

Page 65: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Springboard Story

• Very simple, very quick, very brief

▫ Think elevator ride

• Non-threatening

• Enables listener to visualize

• Catalyzes understanding

• Spark new stories in the mind

• Do not transfer large amounts of information

Page 66: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Story Telling Tips

• Brain’s are not computers▫ Brain Movies – “The brain assembles perceptions by the

simultaneous interaction of whole concepts, whole images.”• The Central Movie – a country or organization

▫ Universal Principles – freedom, democracy, constitutional government

▫ Long-term goals – education, “life, liberty, pursuit of happiness”▫ Operating methods – free markets, due process, federal and state

governments• Capture the Audience

▫ “One time there was this bug we missed…”• 3D Story Telling pg 85-87

▫ Details (facts, information)▫ Dialogue (characters)▫ Drama (a bug that got away?)

Brain Movies, The Central Movie, and 3D Story Telling from“The Leader’s Voice”

Page 67: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Our Last Exercise!

• Your own bug story in 10 minutes▫ Take 10 minutes outlining

your story▫ Goal is a 1-2 minute story Think short and tight

• Remember to▫ Hook the audience▫ 3D Storytelling – Details,

Dialogue, Drama▫ RCA – what change do you

want to convey?

Page 68: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

My Bug Story - Template• Title

• The Hook

• Details – Who, what, when, product/project

• Dialogue – Yelling, Crying, Funny?

• Drama – What is the tension? Anyone Fired?

• What were the Root Causes

• What did you change and why?

Page 69: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

That’s lunch time

Page 70: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Time to Share

• 3 volunteers to come up and tell their bug story

Page 71: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Resources• “The Leader’s Guide to Storytelling” by Steve Denning

▫ Resources – http://www.stevedenning.com/launchgifts.html

▫ Audio Interview - The knowledge-based organization: Using stories to embody and transfer knowledge http://www.storytellingwithchildren.com/2008/01/12/steve-

denning-the-knowledge-based-organization/

• “The Leader’s Voice” by Crossland & Clark▫ http://roncrossland.com/

• Defect Prevention Chapter 11 RCA▫ http://defectprevention.org

• “The Heart of Change” by Cr. John P. Kotter▫ Gloves story can be found on pages 11-12

http://www.linkageinc.com/pdfs/disl/KotterPG.pdf

Page 72: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

http://www.hwtsam.com

http://blogs.msdn.com/kenj

http://twitter.com/rkjohnstonChapter 14 (Software + Services Testing) from “How We Test Software at Microsoft” provided on conference CD courtesy of Microsoft Press

Ken Johnston – Microsoft STARWest 2009 Tutorial TJ

What We Can Learn from Big Bugs that Got Away

Page 73: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Appendix• What follows are a series of slides to teach RCA.

• Some of the slides are integrated in this tutorial on Bugs that Got Away but not all.

Page 74: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

First we need a commonbaseline to work from

Page 75: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Root Cause Analysis 300 Level• Two approaches to RCA

▫ Sentinel Event

▫ Pattern Analysis

• Formal RCA Program▫ When to do an RCA Study

▫ Staffing for Success

▫ Phases of an RCA Study

• The Pit and the Pendulum

▫ Risks of RCA

▫ Benefits of RCA

Based upon Ch. 11http://defectprevention.org

Page 76: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA Sentinel Event

A sentinel event is defined by the Joint

Commission on Accreditation of Healthcare Organizations(JCAHO) as any unanticipated event in a healthcare setting

resulting in death or serious physical or psychological injuryto a person or persons,

http://en.wikipedia.org/wiki/Sentinel_event

Page 77: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA – The Sentinel Event of Bugs

• Home Page of http://wsj.com

• Production Outage▫ I have a lot of these stories

• Security vulnerabilities

• The last bug taken before ship

• “How could we have missed this!”

• Big Bugs that Got Away

Page 78: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA – Office 14 Sentinel Bug Process

• Why SharePoint as the repository▫ Attachments▫ Collaborating▫ Workflow▫ Reporting Dash▫ Wiki▫ Exchange contacts▫ Offline

• Simple Light Weight Approach• Focus on recall class bugs from O14 Beta 1

▫ Will need the answers anyway to get through triage▫ Usually logged in the bug but not easy to find or learn from▫ No consistent process across teams

• Develop a common template in Word• Track on a SharePoint site with some meta data

Page 79: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Office 14 Root Cause “Template”• Tenets/Best Practices• History/Summary• Bugs

▫ Bug number(s)▫ Bug description

• Root Cause Questions▫ Would this get found in our Test Focus/Pass for this area?▫ When did it get broken?▫ Was ownership confused?▫ Would we have assumed that another team would have also seen it?▫ Would it have been reasonable to assume that the fix that caused the

regression would have broken this?▫ Would a code review have likely identified the issue?▫ Was there a partner team(s) involved?▫ Were there multiple PRs involved?▫ Was the feature "Hot" coming into the close of the milestone?

• Engineering Recommendations:▫ Recommendation(s)/Owners▫ 1.▫ 2.▫ 3.

Page 80: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

O14 Example Beta1 End Game

• Word: Japanese Indented Bullets when saved lose their indents▫ Repro:

Set Japanese to be your primary editing language

Create a bulleted list with indents

Save/Close/Re-open

Result: indents are gone

Expect: no loss of indents

▫ Happens with all docs created with that setting in 12 and 14

Page 81: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

O14 Example RCA Recommendations

• Engineering Recommendation:▫ Automate this case and use the code change to inform

other automation needed for this area (lists, styles, paragraph props)

▫ Ensure that ICTs dogfood the product

▫ Make new push for testers to use international settingsmore frequently, with an eye on Beta2 languages and risks associated with each language equivalence class – we’ll most likely drive a Mini-pass on all our features with this setting for Beta2

▫ Add this area to testing executed during regression checks on all style-related fixes.

Page 82: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA Sentinel Bug Approach

• Big Bugs that got away are Sentinel Events

• On bug is indicative of other risk

• The more big bugs the more patterns

• Nothing to do with X-Men

Page 83: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Formal RCA Program(Sentinel Events and Pattern Analysis)

• Started at any time during SDLC

• Often launched after a single expensive bug▫ Security vulnerabilities

▫ Production Outage I have a lot of these stories

• Can be Resource Intensive - so be deliberate

Page 84: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Staffing for Success – RCA Study Analyst

• A Single Analyst or a Team▫ Could be you after today

• Senior with wide range of development process knowledge

• Component Level and System Level analysis• Work with all types - Development, Testing, Program

Management, Operations, Support▫ May include marketing and field personnel

• Skills▫ Defect and low-level code analysis▫ Efficiency Diagnosis▫ RCA Analysis and even understanding▫ Algorithm and metric development▫ Data analysis and presentation

Page 85: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phases of an RCA Program

1. Event Identification

2. Data Collection

3. Data Analysis and Assessment

4. Corrective Action

5. Inform and Apply

6. Follow-up, measurement and reporting

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 86: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase I: Event Identification

• The Sentinel Event▫ Bug that got away and customer found▫ Does not need to be a defect▫ One or multiple

• Often too many bugs to pick from▫ For an RCA program first establish

criteria for a sentinel event

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 87: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase I: Event Identification (Sentinel Event Criteria)

• Not all bugs will yield a true “root” cause• Focus on most severe/undesirable event

▫ “I remember this one bug…”• Risk based assessment criteria

▫ Severity▫ Risk of recurrence▫ Cost – actual and opportunity

Identify Sentinel Event

Criteria

Identify Data Channels

Route Single Event through

Process

Prepare Data & Map Fields (defect tracking system query)

Log Event in RCA Tracking

Database

Event to Analyze

Sentinel Event Data Chanel Loop

Page 88: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase I: Event Identification (Data Chanel – Sources of Data)

• Defect and Test Case Management tracking system• Source code repository and Test code coverage

data• Voice of the Customer

▫ Product support and Customer or marketing data▫ Individual surveys and interviews

• Findings from previous RCA Studies• Crash data through Windows Error Reporting• Services have tickets and data center telemetry

▫ Client and Cloud testing session tomorrow

More about WER @ https://winqual.microsoft.com/

Page 89: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase I: Event Identification(Tracking System)

• Prepare a list of Sentinel Events• Gather and Prepare the Preliminary Data• Route Single Event through Process• Create an RCA Tracking Database

Data Elements of RCA Tracking System

• Event or Study ID, Title & Dates

• Related Defect links

• Failure areas and Source Code

• Timeline of events before and after (vital for services)

• Team Contacts and Owners

• RCA Analysts and Contacts

• Expert Groups and Contacts

• Cause of defect and corrective action

• Survey Data and Results on effectiveness of corrective action

• Log Events in RCA system• Analyze events

• NOTE: Meta Data better suited for lists, documents and shares

Page 90: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase II: Data Collection

• Use Common Sense and Trust Gut Feel▫ “Hey did you hear about the bug…”▫ “I heard BillG was doing a demon when…”

• Use a survey to gather additional data▫ Was this noticed and ignored▫ Is this a common error type▫ Could this have been prevented

• Gather common data on several sentinel events

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 91: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase II: Data Collection

• Windows Customized (Visual Studio Team System)▫ Part of Defect Tracking System▫ Connect to source code▫ Attachments▫ Collaborating▫ Workflow

Windows ezRCA Program

The Goal Reduce Defects Throughout the Product Cycle

The Questions •What type of defect?•What phase was the defect introduced?•What was the extent of the fix?•How long did it take to fix the defect?

The Source •Product Studio Extension (Per Bug Report)

Leverage Points •Distributed Workflow•Quick and Easy Data Collection•Aggregate Analysis and Trend Charts•Subcomponent-Level Data Also Available•Focus on Individual Improvement

• Windows Vista ran a full RCA program

• Windows 7 moved to ezRCA▫ Cut many of the

other data sources▫ Focus on meta data

around bugs

Page 92: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Windows “ezRCA” Approach

Windows ezRCA Program

The Goal Reduce Defects Throughout the Product Cycle

The Questions •What type of defect?•What phase was the defect introduced?•What was the extent of the fix?•How long did it take to fix the defect?

The Source •Product Studio Extension (Per Bug Report)

Leverage Points •Distributed Workflow•Quick and Easy Data Collection•Aggregate Analysis and Trend Charts•Subcomponent-Level Data Also Available•Focus on Individual Improvement

Page 93: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Windows EZ RCA DiagnosisAs is New

• Diagnosis is currently required for all bugs and defaults to NA

• This field should only be activated if the bug is resolved “Fixed” or “Won’t Fix”

• There should be no default value

• Change/combine Hardware & No HW to Hardware Issue

NOTE: Items in RED are new or changed

Assignment ErrorBuild ErrorConcurrency ErrorData Checking ErrorData CorruptionDoc ErrorEnvironment ErrorError Handling ProblemHardware IssueIgnored FailureIncorrect Program StateInterface ErrorMissing Method/FunctionLogic ErrorNot ApplicableOtherResource IssueSimple Coding ErrorSystem ErrorUser Misunderstanding

Page 94: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Windows ezRCA Values

• Initial classification of root causes• Root cause helps us identify the nature of

the kinds of mistakes we are making• This will be a required field for Developers

when resolving a bug that is ‘Fixed’ or ‘Won’t Fix’

• This will be a single-select dropdown list and developers will be expected to select the item that is most applicable

• This field is not intended to replace deep RCA studies and more information will likely be required based on analysis of this data

• For gathering further information, use the Prevention Tab, Test Follow-up Tab, and Bug Analysis Tabs in Product Studio or Soapbox (NOTE: Much of this will be consolidated in the future)

Page 95: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Windows Additional RCA data• Symptom and Prevention categorization

• Link to more info

• Anonymous submission

Page 96: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

ezRCA Pivot Points

ezRCA

• Data on Lots of Bugs

• Few Questions & Answers

• Quick, Easy

• Fully Distributed

Traditional RCA

• Data on Select Fixed Bugs

• Detailed Analysis of Defect

• Multiple-Data Sources

• Significant Investment

• Can be Resource-Limited

Page 97: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase II: Data Collection Keys to

Success• For Sentinel Events open template is fine• For ezRCA Extend bug tracking system with ezData

Collection▫ Keep system light weight▫ Limit required fields▫ Provide opportunity to expand within bug

• For Formal RCA will need multiple data sources and extensible schema

• Recommend you start with Sentinel Events and progress to a formal program

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 98: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Keep going with formal RCA

• Some tools you can use with Sentinel Events and ezRCA

• What good tester doesn’t make you wallow in the details.

Page 99: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment

• Analysis Performed by▫ RCA Team▫ Research Team▫ Related experts

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

• Log all outputs in RCA System• Be judicious with Experts

time

Page 100: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone)

Good article from ASQ –http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html

Page 101: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys)

• Brief History - http://en.wikipedia.org/wiki/5_Whys

▫ Developed by Sakichi Toyoda▫ First used in Toyota (Kaizen), Six Sigma tool

• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem

• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html

▫ why was the website down? The CPU utilization on all our front-end servers went to 100%

▫ why did the CPU usage spike? A new bit of code contained an infinite loop!▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the feature▫ why didn't he write a unit test? He's a new employee, and he was not properly

trained in TDD• Criticism of five whys

▫ Not reproducible across individuals▫ Shown that investigators tent do stop a symptoms rather than root cause▫ Relies upon the investigators knowledge

Page 102: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase III: Data Analysis and Assessment(the Five Whys)

• Brief History - http://en.wikipedia.org/wiki/5_Whys

▫ Developed by Sakichi Toyoda▫ First used in Toyota Motor Corporation▫ Common tool within Kaizen, Lean Manufacturing & Six Sigma

• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem

• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html

▫ why was the website down? The CPU utilization on all our front-end servers went to 100%

▫ why did the CPU usage spike? A new bit of code contained an infinite loop!

▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the

feature▫ why didn't he write a unit test? He's a new employee, and he was not

properly trained in TDD

Page 103: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

• Brief History - http://en.wikipedia.org/wiki/Ishikawa_diagram

▫ Developed by Kaoru Ishikawa in the 1960s

▫ One of the 7 basic quality management tools

• Can use with 5 Whys

▫ Put each why off the first tree point

▫ Ask why for each one of these issues

▫ Keep going until you find one or more root causes

• Some industries have common causes mapped to the fishbone

▫ Original 4 Ms – Machine, Method, Material, Man power

▫ The 8 Ps (Used in Service Industry) – People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product

▫ Ken’s List – People, Process, Tools, Accountability, Training, Recognition and awareness, Inspection and supervision, Pressure or Stress

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Phase III: Data Analysis and Assessment(Fishbone Diagram)

Page 104: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Trending Per-Subcomponent

• Trends Matter

▫ Uptick Warrants More Investigation?

▫ Perform a Traditional RCA for That Set of Events

• Profile

▫ The State of the Code

▫ Personal Improvements

▫ Identify Key Events

Last 5 Weeks

Page 105: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Analysis is not yet at solutions

• Five Whys and Fishbone Diagram help get to root causes

• Data and trending can provide timely alerts and catches regressions

• Root causes are then analyzed for corrective actions

Page 106: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Phase III: Analysis is not the solution(Fishbone Diagram)

• Five Whys and Fishbone Diagram are tools to get to root causes

• Data and trending of bugs can provide timely alerts and catches regressions

• Root causes are then analyzed for corrective actions

Page 107: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase IV: Corrective Actions

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

• Identify Trends and Group Them into Corrective Themes

▫ May be solutions related to Fishbone Diagram mapping buckets

• Meet with the experts again

▫ Remember my warning not to burn out your experts

• Determine Prioritization Factors and Costing for Corrective Actions

▫ Consider Return on Investment (ROI) Should have capture direct cost and opportunity cost during Data Collection

▫ Speed to implement

▫ Likelihood of solution being highly effective

▫ Simplicity of solution

▫ Is the solution automatable or process driven

Page 108: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Bug Wallow #3: Our Corrective

Actions

•Email and Provisioning used Production Data

•Both sanitized the data

•Both impacted production

•What did we change?▫ Stress Tests have no Internet Access

▫ Sanitized Date Diff feature

Page 109: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase V: Inform and Apply

• Host a Management Review

▫ Managers will like RCA more than bugs

▫ You are eliminating a problem not just finding it

• Implementation is a project, treat it that way▫ Assign Owners

▫ Build and Maintain Schedule

▫ Create a Feedback Loop

▫ Establish a Monthly Status Report

▫ Track and correct the corrective action

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up,

Measurement,and Reporting

Page 110: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Phase VI: Follow-up, Measurement, and

Reporting• More than Just

• Six Sigma type approaches• Longitudinal Analysis

▫ Draws from Longitudinal Data Analysis -http://gseacademic.harvard.edu/alda/

▫ Study Over Time • Develop failure types and risk areas/components• Inspect similar products/areas for baseline• Gather and inspect process data• Examine Data for Trends• Report out

Event Identification

Data Collection

Data Analysis and

Assessment

Corrective Actions

Inform and Apply

Follow up, Measurement, and Reporting

Page 111: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Flatonium 2007

• Need to insert video

• 20 new machines added to the data center

• 5 machines put into production early

• Machines needed to be Nuked-N-Paved (NNP)

• Oops

Page 112: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

RCA Pit and Pendulum

Page 113: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Risks of Root Cause Analysis

• Begins with inadequate data

• Go after too much data too early

• Draws incorrect conclusion or makes invalid recommendations▫ Anyone experience this before

• Focus on the wrong set of defects

• Ends at the wrong level – too early or late

• Investment is not always predictable▫ Can be high cost with low ROI

• Over focus on data can detract from the story

Page 114: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

Benefits of Structured RCA Study

• Can start as small pilots

• Uses an identical process regardless of type, age or scope of defect

• Avoids repeat failures

• Can be the shortest path to determining and correcting causes of failure

• Lowers Maintenance Costs

• Builds a culture of ▫ Accountability

▫ Continuous Improvement

Page 115: Ken Johnston - Big Bugs That Got Away -  EuroSTAR 2010

I’ve had enough