39
DATA CONFUSION How to confuse yourself and others with Data Analysis

DATA CONFUSION

Embed Size (px)

DESCRIPTION

DATA CONFUSION. How to confuse yourself and others with Data Analysis. AGENDA FOR TODAY’S TALK. Good Graphs – Bad Graphs The Law of Averages PTBD Analysis Enumerative & Analytical Problems PARC Analysis Wrong Methods of Analysis. “There are three kinds of lies: - PowerPoint PPT Presentation

Citation preview

Page 1: DATA CONFUSION

DATA CONFUSION

How to confuse yourself and others with Data Analysis

Page 2: DATA CONFUSION

AGENDA FOR TODAY’S TALK

• Good Graphs – Bad Graphs• The Law of Averages• PTBD Analysis• Enumerative & Analytical Problems• PARC Analysis• Wrong Methods of Analysis

Page 3: DATA CONFUSION

“There are three kinds of lies:

Lies, damned lies and statistics”Attributed to Benjamin Disraeli by Mark Twain

Page 4: DATA CONFUSION

GOOD GRAPHS AND BAD GRAPHS

Page 5: DATA CONFUSION

DATA RELEVANCE

• Graphs are only as good as the data they display

• No amount of creativity can produce good graphs from dubious data

Page 6: DATA CONFUSION

DATA CONTENT

• Don’t produce graphs from very small amounts of data

• The human brain can grasp 1, 2 or 3 numbers without a graph

Page 7: DATA CONFUSION

RULES FOR PRODUCING GOOD GRAPHS

• KEEP IT SIMPLE AND STUPID– Jesse Ventura

• Tell the truth – don’t distort the data

Page 8: DATA CONFUSION

GOOD GRAPHS

• Portray information without distortion

• Contain no distracting elements

– No false third dimensions, irrelevant decoration, or colour (chartjunk)

• Use an appropriate scale

• Label axes and tick marks properly, including measurement units

• Have a descriptive title and/ or caption and legend

• Have a low ink – to – information ratio

Page 9: DATA CONFUSION

Temperature (degC) of Air and Subject during one day

15

20

25

30

35

40

6 am Noon 6 pm Midnight 6 am

Time of Day

Tem

per

atu

re (

deg

C)

Air

Subject

0

5

10

15

20

25

30

35

40

6 am Noon 6 pm Midnight 6 am

Air

Subject

Temperature (degC) of Air and Subject during one day

0

10

20

30

40

50

60

70

80

90

100

6 am Noon 6 pm Midnight 6 am

Time of Day

Tem

per

atu

re (

deg

C)

Air

Subject

Temperature (degC) of Air and Subject during one day

15

20

25

30

35

40

6 am Noon 6 pm Midnight 6 am

Time of Day

Tem

per

atu

re (

deg

C)

subject

air

BAD GRAPH GOOD GRAPH

BAD GRAPH EVEN BETTER GRAPH

Page 10: DATA CONFUSION

0

2

4

6

8

10

12

14

16

18

A B C D E

Data

EDCBA

17

16

15

14

13

12

Boxplot of A, B, C, D, E

Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington

BAD GRAPH

GOOD GRAPH GOOD GRAPH

Data17.016.516.015.515.014.514.013.513.012.512.011.5

A

B

C

D

E

Dotplot of A, B, C, D, E

Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington

Page 11: DATA CONFUSION

GRAPHS THAT CONFUSE

MONTHLY REJECTS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

MONTH

No

. R

EJ

EC

TS 2001

2002

2003

2004

2005

MONTHLY REJECTS

0%

20%

40%

60%

80%

100%

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

MONTH

No

. R

EJ

EC

TS 2005

2004

2003

2002

2001

MONTHLY REJECTS

0

5000

10000

15000

20000

25000

30000

35000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

MONTH

No

. R

EJ

EC

TS 2005

2004

2003

2002

2001

Page 12: DATA CONFUSION

CHART JUNK

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2001

2003

20050

1000

2000

3000

4000

5000

6000

7000

8000

No

. R

EJ

EC

TS

MONTH

MONTHLY REJECTS

2001

2002

2003

2004

2005

50

00

45

67

44

76

47

25

51

04

51

57

51

86

51

86

52

14

58

11

59

40

59

40

35

05

53

21

42

05

46

01

54

52

54

58

71

74

69

13

42

05

47

34

46

05

51

57

68

52

77

32

52

14

51

63

47

25

53

01

58

01

62

61

60

09

58

72

56

93

44

76

45

03

54

32

54

52

42

47

47

45

36

33

62

61

60

09

58

72

56

93

39

76

44

32

66

66

78

90

58

11

49

89

45

26

64

13

71

74

72

43

60

09

48

70

49

89

45

26

0 5000 10000 15000 20000 25000 30000 35000

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

2001 2002 2003 2004 2005

0%

20%

40%

60%

80%

100%

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2001 2002 2003 2004 2005

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2001

2003

2005

2001

2002

2003

2004

2005

Page 13: DATA CONFUSION

GRAPHS THAT TELL A STORYNo. REJ

ECTS

YEAR

MONTH

2005

2005

2005

2005

2005

2005

2004

2004

2004

2004

2004

2004

2003

2003

2003

2003

2003

2003

2002

2002

2002

2002

2002

2002

2001

2001

2001

2001

2001

2001

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

Nov

Sep

Jul

May

MarJan

8000

7000

6000

5000

4000

3000

Time Series Plot of No. REJECTS

Project: Untitled; Worksheet: Worksheet 3; 04/04/2006; Graham Errington

MONTH

Indiv

idual V

alu

e

8000

6000

4000

2000

_X=5926

UCL=8406

LCL=3445

2001 2002 2003 2004 2005

MONTH

Movin

g R

ange

3000

2000

1000

0

__MR=933

UCL=3048

LCL=0

2001 2002 2003 2004 2005

1

111

22

11

22

1

1

I-MR Chart of No. REJECTS by YEAR

Project: Data for ChewChat 13 Apr 2006.MPJ; Worksheet: Worksheet 3; 04/04/2006; Graham Errington

Page 14: DATA CONFUSION

HISTOGRAMS

• No meaningless gaps

• Reasonable Choice of bins

• Easy to choose or adjust bins

• Good aspect ratio

• Meaningful labels on axes

• Appropriate labels on bin tick marks

Histogram

01020

35

05

47

57

.8

57

1

60

10

.7

14

2

72

63

.5

71

4

Bin

Fre

qu

en

cy

Frequency

Data

Frequency

80007000600050004000

14

12

10

8

6

4

2

0

Histogram of C20

Project: DATA FOR CHEWCHAT 13 APR 2006.MPJ; Worksheet: Worksheet 1; 07/04/2006; Graham Errington

Page 15: DATA CONFUSION

TRENDING RANDOM VARIATION

“Upward trend”

“Downturn”

“Rebound”

“Setback”

“Turnaround”

“Downward trend”

Page 16: DATA CONFUSION

THE LAW OF AVERAGES

“If I sit in a freezer and plunge my head into a pan of boiling chip fat. . . . .

on average, I’m quite comfortable.”

Page 17: DATA CONFUSION

SHEWHART’S RULES FOR PRESENTATION OF DATA

• Rule One

– Data should always be presented in a way that preserves the evidence in the data

• Rule Two

– When an average, standard deviation or histogram is used to summarize data, the user should not be misled into to taking action they would not take if the data were presented in a time series

Page 18: DATA CONFUSION

USING THE WRONG METHODS

Descriptive Statistics: A, B, C, D

Variable N Mean StDev CoefVar Minimum Maximum

A 20 11.950 0.102 0.85 11.83 12.08

B 20 11.950 0.100 0.84 11.85 12.25

C 20 11.950 0.102 0.86 11.75 12.15

D 20 11.950 0.100 0.84 11.81 12.14

Process: A B C D

1 11.85 11.85 11.75 12.14

2 11.83 11.86 11.95 12.01

3 11.87 11.87 11.8 11.88

4 11.84 11.87 11.94 12.07

5 11.85 11.88 11.95 11.95

6 11.86 11.89 12 11.87

7 11.85 11.89 12.05 12.06

8 11.85 11.9 11.85 11.94

9 11.84 11.92 11.94 11.84

10 11.86 11.91 11.85 12.05

11 12.05 11.93 12.05 11.93

12 12.06 11.93 11.85 11.83

13 12.03 11.95 12.05 12.04

14 12.02 11.97 11.95 11.92

15 12.03 11.96 11.95 11.82

16 12.04 11.99 11.95 12.03

17 12.06 12 11.85 11.91

18 12.06 12 12.1 11.81

19 12.04 12.16 12 12.01

20 12.08 12.25 12.15 11.81

Page 19: DATA CONFUSION

NO SIGNIFICANT DIFFERENCE HERE!

One-way ANOVA: A, B, C, D Source DF SS MS F P Factor 3 0.0000 0.0000 0.00 1.000 Error 76 0.7746 0.0102 Total 79 0.7746 S = 0.1010 R-Sq = 0.00% R-Sq(adj) = 0.00% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev --------+---------+---------+---------+- A 20 11.950 0.102 (-----------------*-----------------) B 20 11.950 0.100 (-----------------*-----------------) C 20 11.950 0.102 (-----------------*-----------------) D 20 11.950 0.100 (-----------------*-----------------) --------+---------+---------+---------+- 11.925 11.950 11.975 12.000 Pooled StDev = 0.101

Page 20: DATA CONFUSION

NO DIFFERENCE?!?

Sample

12.2

12.1

12.0

11.9

11.8

2018161412108642

12.2

12.1

12.0

11.9

11.8

2018161412108642

A B

C D

Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Worksheet 1; 05/04/2006; Graham Errington

FOUR PROCESSES WITH SAME MEAN AND SDMean = 11.95, SD = .10

Page 21: DATA CONFUSION

ALWAYS CARRY OUT PTBD ANALYSIS

PLOT THE B….. DOTS!

Page 22: DATA CONFUSION

TYPES OF STATISTICAL STUDIES

• Descriptive

• Enumerative

• Analytic

Page 23: DATA CONFUSION

DESCRIPTIVE STUDY

• Count all fish in barrel

• Count number of goldfish

• Proportion of goldfish applies to the fish population in this barrel and no other barrels of fish

Page 24: DATA CONFUSION

ENUMERATIVE STUDY

• Take a sample of fish from the barrel, and count the number of goldfish in the sample

• Point estimate of the proportion of goldfish in the barrel population

• Many statistical procedures do this

• Cannot make any inference about any other barrels of fish

Page 25: DATA CONFUSION

ANALYTICAL STUDY

• Will we get the same proportion of goldfish in the future as we got in the past?

• An analytical study allows prediction within limits

Fish Packing Process over Time

Page 26: DATA CONFUSION

ANALYTICAL STUDY

• Proportion of goldfish is stable over time

• Fish packing process is predictable within limits

• We can expect, on average, 4 goldfish per barrel, but as many as 10 and as few as 0 in any single barrel

Week No.

Sam

ple

Count

191715131197531

10

8

6

4

2

0

_C=4

UCL=10

LCL=0

C Chart of No goldfish per Barrel

Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Fish in Barrel; 05/04/2006; Graham Errington

Page 27: DATA CONFUSION

ENUMERATIVE vs ANALYTICAL METHODS

• Enumerative methods – seek to provide numeric summaries, confidence intervals,etc– use significance tests, ANOVA, descriptive stats, etc.,

assume single, stable population • Analytical methods

– seek to understand the system under study– use primarily graphical tools such as run charts, control

charts, histograms, box plots, etc– in the real world, most problems are analytical

Page 28: DATA CONFUSION

“Analysis of variance, t-tests, confidence intervals, and other statistical techniques taught in books,….., are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production.”

W.E. Deming, Out of the Crisis

Traditional statistical methods have their place, but are widely abused in the real world. When this is the case, statistics do more to cloud the issue than to enlighten.

Page 29: DATA CONFUSION

PARC ANALYSIS

Practical

Accumulated

Records

Compilation

Passive

Analysis (by)

Regression

Correlations

Planning

After

Research

Completed

Profound

Analysis

Relying (on)

Computers

note inverse relationship with

Continuous

Recording (of)

Administrative

Procedures

Constant

Repetition (of)

Anecdotal

Perceptions

Page 30: DATA CONFUSION

PLANNING A PROCESS IMPROVEMENT STUDY

• Why collect the data?• What statistical methods for analysis?• What data will be collected?• How much data do we need?• How will the data be measured?• How good is the measurement system?• When and where will data be collected?• Who will collect the data?• Remember:

Page 31: DATA CONFUSION

GARBAGE IN – GARBAGE OUT

Page 32: DATA CONFUSION

WHAT’S SIGNIFICANT?

Two-sample T for C1 vs C2

N Mean StDev SE Mean

A 5 13.652 0.487 0.22

B 5 14.369 0.646 0.29

Difference = mu (C1) - mu (C2)

Estimate for difference: -0.716615

95% CI for difference: (-1.551531, 0.118301)

T-Test of difference = 0 (vs not =): T-Value = -1.98 P-Value = 0.083 DF = 8

Both use Pooled StDev = 0.5725

Two-sample T for C3 vs C4

N Mean StDev SE Mean

A 200 13.510 0.501 0.035

A 200 13.667 0.492 0.035

Difference = mu (C3) - mu (C4)

Estimate for difference: -0.157292

95% CI for difference: (-0.254935, -0.059649)

T-Test of difference = 0 (vs not =): T-Value = -3.17 P-Value = 0.002 DF = 398

Both use Pooled StDev = 0.4967

Mean A = 13.7, Mean B = 14.4

Not significant?

Mean A = 13.5, Mean B = 13.7

Significant?

Page 33: DATA CONFUSION

WHAT SHOULD I DO WITH OUTLIERS?

• Data point far away from the rest of the data

• Don’t remove outliers to make data “look good”

• Do you know why it is different?

• If you do, remove it. If you don’t, leave it in

• Could have a big impact on the analysis

• Re – run analysis without outlier, and compare results

Page 34: DATA CONFUSION

“REGRESSION” WITH EXCEL

• Usually means drawing an X-Y plot, fitting a straight line and coming up with an R2 value.

• As long as R2 is high, everything’s hunky-dory.

WRONG!

Page 35: DATA CONFUSION

“REGRESSION” WITH EXCEL

Defects vs Cure Time

y = 0.1913x - 5.5192

R2 = 0.5079

-2

-1

0

1

2

3

4

5

6

20 25 30 35 40 45 50

Cure Time s

No

. o

f D

efe

cts

Relationship is clearly not linear, and should not be presented as such

Page 36: DATA CONFUSION

“REGRESSION” WITH EXCEL• Regression model checking – in Excel?

• Residual plots:

– Normally distributed

– Random pattern when plotted vs fitted values

OK Variance not homogeneous

Model incorrect

Page 37: DATA CONFUSION

PITFALLS OF REGRESSION ANALYSIS

• Non-Linear Relationships

• Influential Points

• Extrapolating

• Lurking Variables

• Summary Data

• Assuming Causation

Page 38: DATA CONFUSION

• THAT’S (WITH REASONABLE PROBABILITY) THE END FOLKS!

And remember,

• With statistics, you never have to say you’re certain!

Page 39: DATA CONFUSION

• THANK YOU FOR YOUR ATTENTION• ARE THERE ANY QUESTIONS?

• GOOD LUCK!!