Data-Driven Software Testing: The New, Lean Approach to Quality

BT14 Concurrent Session 11/13/2014 3:00 PM

"Data Driven Software Testing: The New, Lean Approach"

Presented by:

Ken Johnston Microsoft

Brought to you by:

340 Corporate Way, Suite 300, Orange Park, FL 32073 888‐268‐8770 ∙ 904‐278‐0524 ∙ [email protected] ∙ www.sqe.com

Ken Johnston is the principal engineering manager for the Microsoft Operating Systems Data Integration and Insights team. Since joining Microsoft in 1998, Ken has filled many other roles, including test lead, test manager, and group program manager. Recently he has worked on Bing Data Quality and Measurements, Cosmos (the Microsoft big data platform), and the Windows Apps Store. For two and a half years, Ken served as the Microsoft director of test excellence. He is a frequent speaker, blogger, and author on big data, software testing, and online services development. Contact Ken on Twitter @rkjohnston.

11/25/2014

1

Data Driven SoftwareData Driven Software Testing QualityThe Lean Approach to not Testing

Ken Johnston Principal Data Science ManagerTwitter – @rkjohnstonBlog – http://blogs.msdn.com/kenjEmail – [email protected] - http://linkedin.com/in/rkjohnston

@rkjohnston #BSCADC

About Ken

Data Scientist in Data Driven Outcomes (D2O)

Office Live, WebApps, Office Online

Cosmos, AutoPilot, Local, ShoppingOffice Online Shopping

Next is Data Driven Quality

EaaSy –Everything as a Service “yes!”

Write Books and Blog on Occasion

11/25/2014

2

This is a talk about Change, Big Change

No Test Plans• No Test Plans• Fewer Test Cases• Less Test

Automation• Releasing with

Lower InitialQuality

About this PresentationBig Data + Agile are

i iMinimum

i bl

Analysis and Insights

Driving Change

Data Driven Quality Framework

Viable Quality

Designing for DDQ and Mitigating Risk

Taking Action “Big Data” Search Trends

@rkjohnston #BSCADC

11/25/2014

3

Big Data and Agile

The coming changes are being driven by g g g yBig Data and the rate of product release

@rkjohnston #BSCADC

Massive amounts of Internal Data

Engineering Data But Opinion Still ReignTest pass/fail resultsBug countsCode ComplexityCode CoverageCode Churn

• The HiPPO

• Highest Paid Person’s Opinion

PerformanceReliability

@rkjohnston #BSCADC

11/25/2014

4

Big Data Insights are Real

Improving Confidence Intervals@rkjohnston #BSCADC

Here’s a Classic Story

Cocoa butter lotionA large purseZinc and magnesium supplementsA bright blue rug

What would you do with this information?

But what if the expectant mother was an under aged minor living at home?

11/25/2014

5

Predictive Modeling is RealMicrosoft Bing Launches PredictionsPredictions Lab - https://www.prediction.microsoft.com/

2014 Elections with CNN -http://blogs.bing.com/search/2014/10/21/bing-and-ie-team-with-cnn-for-elections-2014/

D i Wi h h SDancing With the Stars -http://blogs.bing.com/search/2014/09/17/who-is-going-to-win-dancing-with-the-stars/

Uses Social, Search, and Betting Websites

Big Data and Quality is HappeningImproving IE Quality of Experience in a Dynamic Web

The Problem: Sites break, bad experience for our customers

• Huge engineering investment to stay on top of the ever changing g g g y p g gWeb

• Traditionally evaluated through manual testing, bug reporting, and escalations

• Public telemetry unstructured, very poor signal‐noise ratio

Solution: Get more data!

• “Report website problems”, new feature added in April servicing release

• Enables users to provide semi‐structured issue reports to Microsoft—URLs are structured, comments are freeform

• Preserve user Privacy (URLs and min data)

11/25/2014

6

Process and Normalize DataWe used a SARIMA model (Seasonal Autoregressive Integrated Moving Average):

Supports observed seasonality inSupports observed seasonality in weekday/weekend reports, and non‐stationary mean as volume of reports increases over time.

Introduction to Data Driven Quality Framework (DDQF)

11/25/2014

7

Traditional Testing DDQF Cycle• Test Planning • What could go Wrong• Test Cases and

Automation• Test passes

• Instrumentation

• Early Release• Test passes• Defect Management• Sign Off/Release

• Early Release• Data Analysis• Rolling Releases

Asking the Right Questions

• DDQF is an iterative cycle

• Roots in DMAIC

Data AcquisitionRelease

• Roots in DMAIC (Define, Measure, Analyze, Improve, Control)

• Less up front certainty and more iterative

AnalysisTake Action• Release is the key to managing risk. Release is more than Control, it is also the break.

@rkjohnston #BSCADC

11/25/2014

8

Asking the Right gQuestions

The Heart of the Tester:Ruthless Curiosity

Establishing the fundamental questions about quality is criticalabout quality is critical

11/25/2014

9

know• Mean Time Between Failures

(MTBF)• Launch Time• Performance Metrics• Hang Time• Service Up Time and Availability• Page Load Timeg

• Mean Time Between Failures (MTBF)

• Page Dwell Time• Sessions per UU

• Launch Time• Performance Metrics• Hang Time• Service Up Time and Availability• Page Load Time

p• Engagement/Usage Time• Feature Engagement

o Discoverabilityo Return rate

• Click Through Rate (CTR)H Ti

g• Hover Time• Quick Back

11/25/2014

10

To Measure Customer Product Satisfaction you need Scenarios

C I t t WiFi h t t ilWhat is a Scenario?A scenario is a clear and succinct

description of a specific experience and

customer benefit that the product is

designed to deliver.

Can I connect to a WiFi hotspot easily

• Discoverability

• Trust

• Negotiate the connection

Can I file my expense reportg

It is a finite set of product aspects which

can be measured and evaluated from the

customer’s perspective.

y p p

• Does the software know me and keep

my profile

• Did it save my default currency

@rkjohnston #BSCADC

Scenario at a GlanceThink about the User

Software Makes• What does the user need to

do• What do they want to do• How do alternative software

products do it• Can we take steps out of the

Software Makes Goal Achievable

User has a goal

Each Step is a

Scenario Time

Task Com

pleti • Can we take steps out of the way

• What would delight the User

ion

@rkjohnston #BSCADC

11/25/2014

11

Questions Drive InstrumentationInstrument your Code

Software Makes• Scenario Start• Key Steps

• Lost Loops

• Scenario Complete• Success Ratios

Software Makes Goal Achievable

User has a goal

Each Step is a

Scenario

Success Ratios• Time to Task Completion• Minutes of Usage

@rkjohnston #BSCADC

Data Acquisition

11/25/2014

12

Two types of data to acquireActive = syntheticPassive = organic

Active for services only?Cli t i th i th ?Client: is the service there?

Staged Data Acquisition - Netflix

1B API requests per dayper day

Canary Deployment

11/25/2014

13

Staged Data Acquisition - Facebook

DogfoodIn prod, no users (except internal ones)Some servers in ProductionWorld-wide deploymentFeature light-up

Staged Data Acquisition - OutlookFiltering and aggregation at clientBe kind to the client

Pipeline to collect and process dataMake it easy

Staged Data AcquisitionStaged Data AcquisitionFeature Crew

Outlook Team

MS Office Team Microsoft Customers

Scale Validation

11/25/2014

14

Staged Data AcquisitionService

Stage 4: Some

Product (client, on-prem server)

Stage 1: In prod, no users Stage 2: Dogfood Stage 3: Some

servers in prod

Stage 4: Some more servers in

prod

Stage 5 : World-wide prod

Deployment Validation Service Validation Scale Validation

Real-time service quality

( p )Stage 1: Partial or whole product

teamStage 2: Dogfood

Stage 3: Technology Adoption

Programs (TAP)

Stage 4: Some clients in

production

Stage 5: All Customers

11/25/2014

15

Feedback

Feedback

11/25/2014

16


Good Data lets you ask Questions

Software Makes AnalysisSoftware Makes Goal Achievable

User has a goal

Each Step is a

Scenario

Analysis

measures

• Success and Failure Ratios• Are we good enough

• Is Customer Engagement up• Is time to task completion down• Did we get enough user

feedbackfeedback

@rkjohnston #BSCADC

11/25/2014

17

Huge ImpactProduction Data is Real Data

Power of Production Dataoduct o ata

Real usersMultiple environmentsReal users environments

End to endScale & geo‐diversity

Keep your eye on the targetThe goal is not to get a bulls eye every time

The goal is toThe goal is to get the data and Learn

11/25/2014

18

Taking Action

• Change the gSoftware

• Change the Experience

• Change theChange the expected outcome

11/25/2014

19

A/B testingControlled experimentationUsage data on different experiences

Combine into more complex scenariosscenariosHow did user get to shopping cart checkout?

@rkjohnston #BSCADC

We then Re-Evaluate

Software Makes AnalysisSoftware Makes Goal Achievable

User has a goal

Each Step is a

Scenario

Analysis

measures

• Success and Failure Ratios• Are we good enough

• Is Customer Engagement up• Is time to task completion down• Did we get enough user

feedbackfeedback• What should we Change Next

@rkjohnston #BSCADC

11/25/2014

20

Designing and Releasing for DDQ

The world of

How much we tested at Microsoft

Testing thenTypical Industry Level of Testing

possible testing

11/25/2014

21

Say NO to yBUFT

@rkjohnston #BSCADC

Possible TestingWaste, excessive automation and

excessive testing that does not find any meaningful bugs.

Rich instrumentation identifies remaining critical to fix bugs in the

shipped code. Minimum Viable Release Quality

MVQ for all users but still use a rolling release process. Fix final few critical bugs after release

Minimum Viable Quality

ng T

est I

nves

tmen

tMVQ focuses on minimizing up front testing

Rich telemetry from production shifts testing and validation into

Limited Release MVQ for sub-set of users. Beta Users, Enthusiasts, Flighting

Incr

easi

nproduction.

Under Tested – frequent rollbacks, limited user engagement, strong negative customer feedback, bad

press

Under Tested

@rkjohnston #BSCADC

11/25/2014

22

Speed is your friend because…

Code churn is Layer 1

Maximum point of instability is at end of milestone

Code Churn Example 1

cumulative

Imagine this as part of a larger multi-layered project

y

Layer 2

Layer 3Six week coding milestone

Layer 3

• Tightly coupled layers• Long stabilization phase• Complicated end-to-end integration

Sim-ship increases risk

@rkjohnston #BSCADC

11/25/2014

23

Code Churn Example 2 (Continuous Deployment)

Layer 1

Rapid release cadence (weekly or daily)

Max Risk is Production

Layer 1

Layer 2

Layer 3

• Risk per release decreases because of more incremental change

• You still must be careful of Risk within Production but…

• Total risk over time can be less with Layer 3• Total risk over time can be less with incremental change

Layer N

@rkjohnston #BSCADC

User SegmentationOrganizing Users by profile and Risk Tolerance

@rkjohnston #BSCADC

11/25/2014

24

User Segmentation Approaches

• Profile Based• Usage behaviors

• new vs. power users• Browser type• Connection Type• Device and Device OS

• Opted in• Users Segment themselves • Opting in indicates risk tolerance

@rkjohnston #BSCADC

Balancing Speed and Risk with Rings

Ring 4: EveryoneRisk Tolerance

No desire for risk

Ring 2 External Beta UsersRing 2: Company

& NDARing 1: My Team

g y

Ring 0: Buddy Build

Red Line demarks disclosure risk and possible loss of patent rights

Risk Tolerance is highest

@rkjohnston #BSCADC

11/25/2014

25

The AutoPilot Watchdog ModelServers have 3 states they can be in

• Healthyy• Failure mode• Probation

Watchdogs report on server healthRepair Service has 3 actions

• Kill and restart a failing serviceg• Re-boot the server• Re-image the server

@rkjohnston #BSCADC

11/25/2014

26

Watchdogs are not just for Services• Watchdogs built into apps• 3 states for an App3 states for an App

• Kill and restart app sending report• Re-boot the device and re-launch• Fail back to LKG

• Devices are trickier• Build logic into the device so that

if it loses connectivity it can selfif it loses connectivity it can self correct.

• Instead of factory settings why not auto fail back to LKG

@rkjohnston #BSCADC

Generic Service StackProduction

Traffic

Front door servers for loggingService UX Front Door

Service Auth/Identity

Layer A vCurrent

Default P

Front door servers for logging and access management

UX rendering layers

Identity or authentication layers

Layer B vCurrent

Service Layer C (Persistent Data Store)

Path Persistent data layers

@rkjohnston #BSCADC

11/25/2014

27

Runtime Flags Example 1Side-by-Side Deployments

Runtime FlagsProduction

TrafficTest or Forked

Traffic

Service UX Front Door


Runtime Flags• Flags direct traffic through the stack• Used to test vNext before full

release

Layer A vCurrent

Default

Runtime

Traffic Traffic

RuntimeLayer A vCurrent

Layer B vCurrent


t Path

Runtime

Runtime

Layer B vNext

Runtime Flags Example 2N Test Environments

ProductionTraffic

Test Case

CheckinTests

Service UX Front Door


Layer A vCurrent

Traffic CaseTests

Default

RuntimeRuntime

Layer A DevBox Layer A vCurrent

Layer B vCurrent


Path

Runtime

Runtime

Layer A DevBox

Layer B Test Cluster

11/25/2014

28

Apps as a Service: FacebookHow Facebook secretly redesigned its iPhone app with your help

a system for creating alternate versions within the…a system for creating alternate versions… within the native app. The team could then turn on certain new features for a subset of its users, directly,

…a system of "different types of Legos... and see the results on the server in real time "results on the server in real time.

From article on The Verge by Dieter Bohn September 18, 2013

@rkjohnston #BSCADC

That Was a lot of ContentBig Data + Agile are

i iMinimum

i bl


Driving Change

Data Driven Quality Framework

Viable Quality

Designing for DDQ and Mitigating Risk

Taking Action “Big Data” Search Trends

@rkjohnston #BSCADC

11/25/2014

29

Date

Th k YThank You

@rkjohnston #BSCADC