View
149
Download
1
Embed Size (px)
DESCRIPTION
EuroSTAR Software Testing Conference 2010 presentation on Big Bugs That Got Away by Ken Johnston . See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
Citation preview
What We Can Learn from Big
Bugs that Got Away
Ken Johnston, Group ManagerOffice, Internet Platforms & Operation
EuroSTAR 2010
I Want to know more about YOU
• Who wandered in here by accident
• Who is at EuroSTAR for the first time
• How long have you been in Software Testing
• Have you ever missed a bug
• Have you ever heard…
“HOW COULD
YOU MISS
THAT BUG!!!”
Def. – Rolling around in something disgusting
Ken’s Big
Bug Story
It all began one dark and stormy night!
Session Overview
• About you, me and setting the tone• Bug Wallowing 1 – A self reflective journey• Bug Wallowing 2 – Group Therapy• Root Cause Analysis 101
▫ Sentinel Events▫ Pattern Analysis▫ Formal RCA program overview
• Bug Wallowing 3• Five Whys• Bug Wallow 4• Fishbone• Bug Wallowing 5• Crafting a good bug story
P
P
Learning Objectives
1. Be armed to deal with the question, “How did test miss this bug.”
2. Learn a little about formal RCA and the use of the 5 Whys and Fishbone tools
3. Have a number of highly instructive bug stories from within your organization that you can take home
Def. – Roll in something: to lie down and roll around in something
“HOW COULD
YOU MISS
THAT BUG!!!”
Time for some “Group Bug” Therapy
Repeat After Me
• I did not design the bug.
• I did not code the bug.
• I found crashing bugs, data corruption bugs, fit and finish bugs.
• I found hundreds of bugs.
Repeat After Me
•So what if I missed a bug.
• I didn’t write the bug in the first place.
Activity Share your Bug Story
• Take the next 10 minutes
• Groups of 2 or 3
• Think of a bug that got away
• Minimum One Bug story each
• Questions to ask
▫ How long after ship did you see this
▫ How big was the impact
▫ How did it get missed
▫ What did you change because of this bug
That’s Time
Time to Share
• Next 5 minutes or so
• Did you have any Ah Ha moments?
Why do we Wallow in Bugs that got away?
• Take 3-5 minutes to discuss in your groups
That’s time
Time to Share
• What did you come up with?• Why do we wallow?• Why do we RCA bugs?• My List
▫ To learn from mistakes▫ To systematically identify
areas for improvement▫ To prevent repetition of
mistakes▫ Bugs are stories and
organizations are driven by the stories they tell
First we need a commonbaseline to work from
Root Cause Analysis 300 Level• Two approaches to RCA
▫ Sentinel Event
▫ Pattern Analysis
• Formal RCA Program▫ Data Collection
▫ Data Analysis and Assessment
▫ Corrective Actions
• The Pit and the Pendulum
▫ Risks of RCA
▫ Benefits of RCA
Based upon Ch. 11PDF available to EuroSTARattendeeshttp://defectprevention.org
RCA –Sentinel Event Bugs
• How do you know it’s a Sentinel Event Bug?
• If you make the front page of the http://wsj.com
• Production Outage▫ I have a lot of these stories
• Security vulnerabilities
• The last bug taken before ship▫ “How could we have missed this!”
• Any big bug that got away
• Nothing to do with the X-Men
RCA Pattern Analysis
• Pattern Analysis requires a lot of bugs
• Pattern Analysis can be done over time
• Pattern Analysis is best served within a formal RCA Program.
▫ Cut some of the slides from this presentation
▫ The full set of slides can be found in the appendix on the EuroSTAR conference website
Phases of an RCA Program
1. Event Identification
2. Data Collection
3. Data Analysis and Assessment
4. Corrective Action
5. Inform and Apply
6. Follow-up, measurement and reporting
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
P
Phase 2: Data CollectionExercise
• Data Channels 5 Minute Discussion in Groups▫ What are the sources of data in my
organization▫ Which are practical▫ Which are the most costly to
implement▫ Which are most likely to yield results▫ Do you have time to implement these
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
That’s time
Phase 2: Data Collection Time to Share
• What sources did you come up with?
Phase 2: Data Collection(Sources of Data)
• Defect and Test Case Management tracking system• Source code repository and Test code coverage
data• Voice of the Customer
▫ Product support and Customer or marketing data▫ Individual surveys and interviews
• Findings from previous RCA Studies• Crash data through Windows Error Reporting• Services have tickets and data center telemetry
▫ Heuristic Data of live site now vs. historic
More about WER @ https://winqual.microsoft.com/
Phase 2: Data Collection(Tracking System)
• Prepare a list of Sentinel Events• Gather and Prepare the Preliminary Data• Route Single Event through Process• Create an RCA Tracking Database
Data Elements of RCA Tracking System
• Event or Study ID, Title & Dates
• Related Defect links
• Failure areas and Source Code
• Timeline of events before and after (vital for services)
• Team Contacts and Owners
• RCA Analysts and Contacts
• Expert Groups and Contacts
• Cause of defect and corrective action
• Survey Data and Results on effectiveness of corrective action
• Log Events in RCA system• Analyze events
• NOTE: Meta Data better suited for lists, documents and shares
Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone)
Good article from ASQ –http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase III: Data Analysis and Assessment(the Five Whys)
• Brief History - http://en.wikipedia.org/wiki/5_Whys
▫ Developed by Sakichi Toyoda▫ First used in Toyota Motor Corporation▫ Common tool within Kaizen, Lean Manufacturing & Six Sigma
• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem
• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫ why was the website down? The CPU utilization on all our front-end servers went to 100%
▫ why did the CPU usage spike? A new bit of code contained an infinite loop!
▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the
feature▫ why didn't he write a unit test? He's a new employee, and he was not
properly trained in TDD
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Def. – indulge in something excessively: to take pleasure or be immersed in something in a self-indulgent way
Insert Bug Story Videos
Five Whys Exercise
• Take 5-10 minutes• Use one of these bugs or one
of your own• Try the five whys and see if
you can find a root cause
That’s time
One does not worry
about grace or dignity
Time to Share
• Time for about 2 examples
• What about the 5 Whys worked for you
• Where did it fall short?
Phase III: Data Analysis and Assessment(the Five Whys)
•Criticism of five whys▫Not reproducible across
individuals▫Shown that investigators tent do
stop a symptoms rather than root cause
▫Relies upon the investigators knowledge
• Brief History - http://en.wikipedia.org/wiki/Ishikawa_diagram
▫ Developed by Kaoru Ishikawa in the 1960s
▫ One of the 7 basic quality management tools
• Can use with 5 whys
▫ Put each why off the first tree point
▫ Ask why for each one of these issues
▫ Keep going until you find one or more root causes
• Some industries have common causes mapped to the fishbone
▫ Original 4 Ms – Machine, Method, Material, Man power
▫ The 8 Ps (Used in Service Industry) – People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product
▫ Ken’s List – People& Training, Tools, Inspection and supervision, Pressure or Stress, Process & Accountability, Recognition & Awareness
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase III: Data Analysis and Assessment(Fishbone Diagram)
Pressure or Stress
Recognition & Awareness Process & AccountabilityTools
Inspection & SupervisionPeople & Training
Brownout across 3 largest datacenters
• Deployment tool changes
▫ Warn but do not prevent multi-DC deployments
▫ Automatically generate rollback script
▫ Cross service monitors will cancel and roll back a bad deployment automatically • Process changes
▫ Deployment code review
▫ Deployment checklist
▫ Audits and Fire drills
Audited all alerts, escalation aliases and contact #s
Fire drill email and phone
• New Tools
▫ Per-Alert fault injection
• Recognition
▫ SWAT DRI team for most senior DRIs
Fishbone Exercise
• Take 5-10 minutes• Have a handout for you• Use the same bug from the
five whys exercise
That’s time
Time to Share
• Time to share▫ Who did the same bug as
the five whys?
▫ Who did a different bug?
• What about the fishbone worked for you?
• Where did it fall short?
Phase III: Data Analysis and Assessment(the Fishbone)
•Criticism of Fishbone▫Requires a lot of experts for
each branch
▫Cumbersome
Phase V: Inform and Apply
• Host a Management Review
▫ Managers will like RCA more than bugs
▫ You are eliminating a problem not just finding it
• Implementation is a project, treat it that way▫ Assign Owners
▫ Build and Maintain Schedule
▫ Create a Feedback Loop
▫ Establish a Monthly Status Report
▫ Track and correct the corrective action
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase VI: Follow-up, Measurement, and
Reporting• More than Just
• Six Sigma type approaches• Longitudinal Analysis
▫ Draws from Longitudinal Data Analysis -http://gseacademic.harvard.edu/alda/
▫ Study Over Time • Develop failure types and risk areas/components• Inspect similar products/areas for baseline• Gather and inspect process data• Examine Data for Trends• Report out
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
Def. – have huge amount of something: to have an ample or excessive supply of something
RCA Pit and Pendulum
Risks of Root Cause Analysis
• Begins with inadequate data
• Go after too much data too early
• Draws incorrect conclusion or makes invalid recommendations▫ Anyone experience this before
• Focus on the wrong set of defects
• Ends at the wrong level – too early or late
• Investment is not always predictable▫ Can be high cost with low ROI
• Over focus on data can detract from the story
Benefits of Structured RCA Study
• Can start as small pilots
• Uses an identical process regardless of type, age or scope of defect
• Avoids repeat failures
• Can be the shortest path to determining and correcting causes of failure
• Lowers Maintenance Costs
• Builds a culture of ▫ Accountability
▫ Continuous Improvement
Achieve Balance
• Full Blow RCA with large pattern analysis rarely meets ROI goals.
• Limit the scope▫ Few Data Sources
▫ Beware of the RCA Tax• Focus on Sentinel Events
▫ Provides opportunity for clear visible winds
▫ If it’s a bug that got away you’ll be doing a Post Mortem anyway
▫ Sentinel events provide an opportunity to change the dialogue
I’ve had enough
Telling a Tall Tale
So why a focus on Bugs that got away
• Bugs that got away are Sentinel Events
• They are great stories▫ There is never an end to bugs
• Bug Stories are Organizational Knowledge
• Tribal Knowledge drives organizations
• Stories are powerful change enablers
Stories Work!
Biographies
Allegories
Gloves on the Boardroom Table
• The Heart of Change▫ Requires an emotional
component▫ What is more emotional than
“How could test miss this bug!”
• Not all change stories involve yelling
• Visual and tactile help too▫ Handout of “Gloves on the
boardroom table”▫ [email protected] “I love your idea. And you have
my permission.”
Organizational Development
• I worked in Engineering Excellence▫ We were Performance Improvement organization
▫ Enterprise Change Management
• Let me bring in some OD concepts
Knowledge Management (KM)
comprises a range of practices used in an organization to identify, create, represent, distribute and enable adoption of insights and experiences.
Such insights and experiences comprise knowledge, either embodied in
individuals or embedded in
organizational processes or practice.
http://en.wikipedia.org/wiki/Knowledge_management
What are Organizations Made of?
PEOPLE
What do people do?
Talk about stuff
Tribal Knowledge
Institutional memory is a collective set of
facts, concepts, experiences and know-how held
by a group of people.
http://en.wikipedia.org/wiki/Institutional_memory
Organizational StorytellingThe study of organizational storytelling, sometimes called
“Narrative Knowledge,” attempts to
recount events in the form of a storywithin the context of an organizationhttp://en.wikipedia.org/wiki/Organizational_Storytelling
So, what is a bug story?
be part of the Organizational
Narrative Knowledge
that should…
Springboard Story
• Very simple, very quick, very brief
▫ Think elevator ride
• Non-threatening
• Enables listener to visualize
• Catalyzes understanding
• Spark new stories in the mind
• Do not transfer large amounts of information
Story Telling Tips
• Brain’s are not computers▫ Brain Movies – “The brain assembles perceptions by the
simultaneous interaction of whole concepts, whole images.”• The Central Movie – a country or organization
▫ Universal Principles – freedom, democracy, constitutional government
▫ Long-term goals – education, “life, liberty, pursuit of happiness”▫ Operating methods – free markets, due process, federal and state
governments• Capture the Audience
▫ “One time there was this bug we missed…”• 3D Story Telling pg 85-87
▫ Details (facts, information)▫ Dialogue (characters)▫ Drama (a bug that got away?)
Brain Movies, The Central Movie, and 3D Story Telling from“The Leader’s Voice”
Our Last Exercise!
• Your own bug story in 10 minutes▫ Take 10 minutes outlining
your story▫ Goal is a 1-2 minute story Think short and tight
• Remember to▫ Hook the audience▫ 3D Storytelling – Details,
Dialogue, Drama▫ RCA – what change do you
want to convey?
My Bug Story - Template• Title
• The Hook
• Details – Who, what, when, product/project
• Dialogue – Yelling, Crying, Funny?
• Drama – What is the tension? Anyone Fired?
• What were the Root Causes
• What did you change and why?
That’s lunch time
Time to Share
• 3 volunteers to come up and tell their bug story
Resources• “The Leader’s Guide to Storytelling” by Steve Denning
▫ Resources – http://www.stevedenning.com/launchgifts.html
▫ Audio Interview - The knowledge-based organization: Using stories to embody and transfer knowledge http://www.storytellingwithchildren.com/2008/01/12/steve-
denning-the-knowledge-based-organization/
• “The Leader’s Voice” by Crossland & Clark▫ http://roncrossland.com/
• Defect Prevention Chapter 11 RCA▫ http://defectprevention.org
• “The Heart of Change” by Cr. John P. Kotter▫ Gloves story can be found on pages 11-12
http://www.linkageinc.com/pdfs/disl/KotterPG.pdf
http://www.hwtsam.com
http://blogs.msdn.com/kenj
http://twitter.com/rkjohnstonChapter 14 (Software + Services Testing) from “How We Test Software at Microsoft” provided on conference CD courtesy of Microsoft Press
Ken Johnston – Microsoft STARWest 2009 Tutorial TJ
What We Can Learn from Big Bugs that Got Away
Appendix• What follows are a series of slides to teach RCA.
• Some of the slides are integrated in this tutorial on Bugs that Got Away but not all.
First we need a commonbaseline to work from
Root Cause Analysis 300 Level• Two approaches to RCA
▫ Sentinel Event
▫ Pattern Analysis
• Formal RCA Program▫ When to do an RCA Study
▫ Staffing for Success
▫ Phases of an RCA Study
• The Pit and the Pendulum
▫ Risks of RCA
▫ Benefits of RCA
Based upon Ch. 11http://defectprevention.org
RCA Sentinel Event
A sentinel event is defined by the Joint
Commission on Accreditation of Healthcare Organizations(JCAHO) as any unanticipated event in a healthcare setting
resulting in death or serious physical or psychological injuryto a person or persons,
http://en.wikipedia.org/wiki/Sentinel_event
RCA – The Sentinel Event of Bugs
• Home Page of http://wsj.com
• Production Outage▫ I have a lot of these stories
• Security vulnerabilities
• The last bug taken before ship
• “How could we have missed this!”
• Big Bugs that Got Away
RCA – Office 14 Sentinel Bug Process
• Why SharePoint as the repository▫ Attachments▫ Collaborating▫ Workflow▫ Reporting Dash▫ Wiki▫ Exchange contacts▫ Offline
• Simple Light Weight Approach• Focus on recall class bugs from O14 Beta 1
▫ Will need the answers anyway to get through triage▫ Usually logged in the bug but not easy to find or learn from▫ No consistent process across teams
• Develop a common template in Word• Track on a SharePoint site with some meta data
Office 14 Root Cause “Template”• Tenets/Best Practices• History/Summary• Bugs
▫ Bug number(s)▫ Bug description
• Root Cause Questions▫ Would this get found in our Test Focus/Pass for this area?▫ When did it get broken?▫ Was ownership confused?▫ Would we have assumed that another team would have also seen it?▫ Would it have been reasonable to assume that the fix that caused the
regression would have broken this?▫ Would a code review have likely identified the issue?▫ Was there a partner team(s) involved?▫ Were there multiple PRs involved?▫ Was the feature "Hot" coming into the close of the milestone?
• Engineering Recommendations:▫ Recommendation(s)/Owners▫ 1.▫ 2.▫ 3.
O14 Example Beta1 End Game
• Word: Japanese Indented Bullets when saved lose their indents▫ Repro:
Set Japanese to be your primary editing language
Create a bulleted list with indents
Save/Close/Re-open
Result: indents are gone
Expect: no loss of indents
▫ Happens with all docs created with that setting in 12 and 14
O14 Example RCA Recommendations
• Engineering Recommendation:▫ Automate this case and use the code change to inform
other automation needed for this area (lists, styles, paragraph props)
▫ Ensure that ICTs dogfood the product
▫ Make new push for testers to use international settingsmore frequently, with an eye on Beta2 languages and risks associated with each language equivalence class – we’ll most likely drive a Mini-pass on all our features with this setting for Beta2
▫ Add this area to testing executed during regression checks on all style-related fixes.
RCA Sentinel Bug Approach
• Big Bugs that got away are Sentinel Events
• On bug is indicative of other risk
• The more big bugs the more patterns
• Nothing to do with X-Men
Formal RCA Program(Sentinel Events and Pattern Analysis)
• Started at any time during SDLC
• Often launched after a single expensive bug▫ Security vulnerabilities
▫ Production Outage I have a lot of these stories
• Can be Resource Intensive - so be deliberate
Staffing for Success – RCA Study Analyst
• A Single Analyst or a Team▫ Could be you after today
• Senior with wide range of development process knowledge
• Component Level and System Level analysis• Work with all types - Development, Testing, Program
Management, Operations, Support▫ May include marketing and field personnel
• Skills▫ Defect and low-level code analysis▫ Efficiency Diagnosis▫ RCA Analysis and even understanding▫ Algorithm and metric development▫ Data analysis and presentation
Phases of an RCA Program
1. Event Identification
2. Data Collection
3. Data Analysis and Assessment
4. Corrective Action
5. Inform and Apply
6. Follow-up, measurement and reporting
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase I: Event Identification
• The Sentinel Event▫ Bug that got away and customer found▫ Does not need to be a defect▫ One or multiple
• Often too many bugs to pick from▫ For an RCA program first establish
criteria for a sentinel event
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase I: Event Identification (Sentinel Event Criteria)
• Not all bugs will yield a true “root” cause• Focus on most severe/undesirable event
▫ “I remember this one bug…”• Risk based assessment criteria
▫ Severity▫ Risk of recurrence▫ Cost – actual and opportunity
Identify Sentinel Event
Criteria
Identify Data Channels
Route Single Event through
Process
Prepare Data & Map Fields (defect tracking system query)
Log Event in RCA Tracking
Database
Event to Analyze
Sentinel Event Data Chanel Loop
Phase I: Event Identification (Data Chanel – Sources of Data)
• Defect and Test Case Management tracking system• Source code repository and Test code coverage
data• Voice of the Customer
▫ Product support and Customer or marketing data▫ Individual surveys and interviews
• Findings from previous RCA Studies• Crash data through Windows Error Reporting• Services have tickets and data center telemetry
▫ Client and Cloud testing session tomorrow
More about WER @ https://winqual.microsoft.com/
Phase I: Event Identification(Tracking System)
• Prepare a list of Sentinel Events• Gather and Prepare the Preliminary Data• Route Single Event through Process• Create an RCA Tracking Database
Data Elements of RCA Tracking System
• Event or Study ID, Title & Dates
• Related Defect links
• Failure areas and Source Code
• Timeline of events before and after (vital for services)
• Team Contacts and Owners
• RCA Analysts and Contacts
• Expert Groups and Contacts
• Cause of defect and corrective action
• Survey Data and Results on effectiveness of corrective action
• Log Events in RCA system• Analyze events
• NOTE: Meta Data better suited for lists, documents and shares
Phase II: Data Collection
• Use Common Sense and Trust Gut Feel▫ “Hey did you hear about the bug…”▫ “I heard BillG was doing a demon when…”
• Use a survey to gather additional data▫ Was this noticed and ignored▫ Is this a common error type▫ Could this have been prevented
• Gather common data on several sentinel events
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase II: Data Collection
• Windows Customized (Visual Studio Team System)▫ Part of Defect Tracking System▫ Connect to source code▫ Attachments▫ Collaborating▫ Workflow
Windows ezRCA Program
The Goal Reduce Defects Throughout the Product Cycle
The Questions •What type of defect?•What phase was the defect introduced?•What was the extent of the fix?•How long did it take to fix the defect?
The Source •Product Studio Extension (Per Bug Report)
Leverage Points •Distributed Workflow•Quick and Easy Data Collection•Aggregate Analysis and Trend Charts•Subcomponent-Level Data Also Available•Focus on Individual Improvement
• Windows Vista ran a full RCA program
• Windows 7 moved to ezRCA▫ Cut many of the
other data sources▫ Focus on meta data
around bugs
Windows “ezRCA” Approach
Windows ezRCA Program
The Goal Reduce Defects Throughout the Product Cycle
The Questions •What type of defect?•What phase was the defect introduced?•What was the extent of the fix?•How long did it take to fix the defect?
The Source •Product Studio Extension (Per Bug Report)
Leverage Points •Distributed Workflow•Quick and Easy Data Collection•Aggregate Analysis and Trend Charts•Subcomponent-Level Data Also Available•Focus on Individual Improvement
Windows EZ RCA DiagnosisAs is New
• Diagnosis is currently required for all bugs and defaults to NA
• This field should only be activated if the bug is resolved “Fixed” or “Won’t Fix”
• There should be no default value
• Change/combine Hardware & No HW to Hardware Issue
NOTE: Items in RED are new or changed
Assignment ErrorBuild ErrorConcurrency ErrorData Checking ErrorData CorruptionDoc ErrorEnvironment ErrorError Handling ProblemHardware IssueIgnored FailureIncorrect Program StateInterface ErrorMissing Method/FunctionLogic ErrorNot ApplicableOtherResource IssueSimple Coding ErrorSystem ErrorUser Misunderstanding
Windows ezRCA Values
• Initial classification of root causes• Root cause helps us identify the nature of
the kinds of mistakes we are making• This will be a required field for Developers
when resolving a bug that is ‘Fixed’ or ‘Won’t Fix’
• This will be a single-select dropdown list and developers will be expected to select the item that is most applicable
• This field is not intended to replace deep RCA studies and more information will likely be required based on analysis of this data
• For gathering further information, use the Prevention Tab, Test Follow-up Tab, and Bug Analysis Tabs in Product Studio or Soapbox (NOTE: Much of this will be consolidated in the future)
Windows Additional RCA data• Symptom and Prevention categorization
• Link to more info
• Anonymous submission
ezRCA Pivot Points
ezRCA
• Data on Lots of Bugs
• Few Questions & Answers
• Quick, Easy
• Fully Distributed
Traditional RCA
• Data on Select Fixed Bugs
• Detailed Analysis of Defect
• Multiple-Data Sources
• Significant Investment
• Can be Resource-Limited
Phase II: Data Collection Keys to
Success• For Sentinel Events open template is fine• For ezRCA Extend bug tracking system with ezData
Collection▫ Keep system light weight▫ Limit required fields▫ Provide opportunity to expand within bug
• For Formal RCA will need multiple data sources and extensible schema
• Recommend you start with Sentinel Events and progress to a formal program
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Keep going with formal RCA
• Some tools you can use with Sentinel Events and ezRCA
• What good tester doesn’t make you wallow in the details.
Phase III: Data Analysis and Assessment
• Analysis Performed by▫ RCA Team▫ Research Team▫ Related experts
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
• Log all outputs in RCA System• Be judicious with Experts
time
Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone)
Good article from ASQ –http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html
Phase III: Data Analysis and Assessment(the Five Whys)
• Brief History - http://en.wikipedia.org/wiki/5_Whys
▫ Developed by Sakichi Toyoda▫ First used in Toyota (Kaizen), Six Sigma tool
• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem
• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫ why was the website down? The CPU utilization on all our front-end servers went to 100%
▫ why did the CPU usage spike? A new bit of code contained an infinite loop!▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the feature▫ why didn't he write a unit test? He's a new employee, and he was not properly
trained in TDD• Criticism of five whys
▫ Not reproducible across individuals▫ Shown that investigators tent do stop a symptoms rather than root cause▫ Relies upon the investigators knowledge
Phase III: Data Analysis and Assessment(the Five Whys)
• Brief History - http://en.wikipedia.org/wiki/5_Whys
▫ Developed by Sakichi Toyoda▫ First used in Toyota Motor Corporation▫ Common tool within Kaizen, Lean Manufacturing & Six Sigma
• What is it▫ Simply put - ask why 5 times to get to the root cause of a problem
• Fun Example from - http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫ why was the website down? The CPU utilization on all our front-end servers went to 100%
▫ why did the CPU usage spike? A new bit of code contained an infinite loop!
▫ why did that code get written? So-and-so made a mistake▫ why did his mistake get checked in? He didn't write a unit test for the
feature▫ why didn't he write a unit test? He's a new employee, and he was not
properly trained in TDD
• Brief History - http://en.wikipedia.org/wiki/Ishikawa_diagram
▫ Developed by Kaoru Ishikawa in the 1960s
▫ One of the 7 basic quality management tools
• Can use with 5 Whys
▫ Put each why off the first tree point
▫ Ask why for each one of these issues
▫ Keep going until you find one or more root causes
• Some industries have common causes mapped to the fishbone
▫ Original 4 Ms – Machine, Method, Material, Man power
▫ The 8 Ps (Used in Service Industry) – People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product
▫ Ken’s List – People, Process, Tools, Accountability, Training, Recognition and awareness, Inspection and supervision, Pressure or Stress
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase III: Data Analysis and Assessment(Fishbone Diagram)
Trending Per-Subcomponent
• Trends Matter
▫ Uptick Warrants More Investigation?
▫ Perform a Traditional RCA for That Set of Events
• Profile
▫ The State of the Code
▫ Personal Improvements
▫ Identify Key Events
Last 5 Weeks
Analysis is not yet at solutions
• Five Whys and Fishbone Diagram help get to root causes
• Data and trending can provide timely alerts and catches regressions
• Root causes are then analyzed for corrective actions
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase III: Analysis is not the solution(Fishbone Diagram)
• Five Whys and Fishbone Diagram are tools to get to root causes
• Data and trending of bugs can provide timely alerts and catches regressions
• Root causes are then analyzed for corrective actions
Phase IV: Corrective Actions
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
• Identify Trends and Group Them into Corrective Themes
▫ May be solutions related to Fishbone Diagram mapping buckets
• Meet with the experts again
▫ Remember my warning not to burn out your experts
• Determine Prioritization Factors and Costing for Corrective Actions
▫ Consider Return on Investment (ROI) Should have capture direct cost and opportunity cost during Data Collection
▫ Speed to implement
▫ Likelihood of solution being highly effective
▫ Simplicity of solution
▫ Is the solution automatable or process driven
Bug Wallow #3: Our Corrective
Actions
•Email and Provisioning used Production Data
•Both sanitized the data
•Both impacted production
•What did we change?▫ Stress Tests have no Internet Access
▫ Sanitized Date Diff feature
Phase V: Inform and Apply
• Host a Management Review
▫ Managers will like RCA more than bugs
▫ You are eliminating a problem not just finding it
• Implementation is a project, treat it that way▫ Assign Owners
▫ Build and Maintain Schedule
▫ Create a Feedback Loop
▫ Establish a Monthly Status Report
▫ Track and correct the corrective action
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up,
Measurement,and Reporting
Phase VI: Follow-up, Measurement, and
Reporting• More than Just
• Six Sigma type approaches• Longitudinal Analysis
▫ Draws from Longitudinal Data Analysis -http://gseacademic.harvard.edu/alda/
▫ Study Over Time • Develop failure types and risk areas/components• Inspect similar products/areas for baseline• Gather and inspect process data• Examine Data for Trends• Report out
Event Identification
Data Collection
Data Analysis and
Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
Flatonium 2007
• Need to insert video
• 20 new machines added to the data center
• 5 machines put into production early
• Machines needed to be Nuked-N-Paved (NNP)
• Oops
RCA Pit and Pendulum
Risks of Root Cause Analysis
• Begins with inadequate data
• Go after too much data too early
• Draws incorrect conclusion or makes invalid recommendations▫ Anyone experience this before
• Focus on the wrong set of defects
• Ends at the wrong level – too early or late
• Investment is not always predictable▫ Can be high cost with low ROI
• Over focus on data can detract from the story
Benefits of Structured RCA Study
• Can start as small pilots
• Uses an identical process regardless of type, age or scope of defect
• Avoids repeat failures
• Can be the shortest path to determining and correcting causes of failure
• Lowers Maintenance Costs
• Builds a culture of ▫ Accountability
▫ Continuous Improvement
I’ve had enough