44
Diversity in Datasets: (d) e constructing Descriptive Statistics and Data Visualization Douglas James Joubert National Institutes of Health Library

Descriptive Statistics and Data Visualization

Embed Size (px)

DESCRIPTION

Outlines the basics of descriptive statistics

Citation preview

Page 1: Descriptive Statistics and Data Visualization

Diversity in Datasets: (d)econstructing Descriptive Statistics and Data Visualization

Douglas James JoubertNational Institutes of Health Library

Page 2: Descriptive Statistics and Data Visualization

Outline

Types of Scale Levels of Measurement Descriptive vs. Inferential Statistics Univariate Analysis Graphical Methods for Displaying Data

Page 3: Descriptive Statistics and Data Visualization

Before you Survey

Consult with a Statistician

Vital toyour success

Great way tocollaborate

Page 4: Descriptive Statistics and Data Visualization

Analysis Always Follows Design

Johnson (2005)

Question

Hypothesis

Experimental DesignSample

sData

Analysis

Page 5: Descriptive Statistics and Data Visualization

Descriptive Statistics

Location Spread (Dispersion)Shape of theDistribution

MeanMode

Median

SDVariance

COV

Skewness(+ or -)

Kurtosis

Page 6: Descriptive Statistics and Data Visualization

Levels of Measurement

The questions you ask are just as important as what is being measured Consult, confer, and pick apart your hypothesis

Results are only as good as your poorest measurement Your measurement will never provide the absolute

truth Try to control as much as possible to reduce

error Random error – due to chance – either direction Systematic error – due to bias – one direction

Page 7: Descriptive Statistics and Data Visualization

Triangulate

Different measures for same construct

X2X1

Reducing Measurement Error

Page 8: Descriptive Statistics and Data Visualization

Types of Scale

Nominal or Categorical Mutually exclusive group: gender, sick vs. healthy,

remote user vs. library user Used for identification purposes only Cannot be ranked from smallest to largest

Ordinal Mutually exclusive group that is also ordered in a

meaningful manner Distance between categories is unknown—you

cannot say that a person with a job satisfaction of 2 is twice as satisfied as a person rated as a 1

Page 9: Descriptive Statistics and Data Visualization

Types of Scale

Interval Ordered groups with equal intervals between any

two pairs of adjacent classes No absolute zero and you cannot compute ratios,

for example, temperature Ratio

Interval scale with a true absolute zero, for example, weight

You can tell how much larger or smaller one value is compared with another

Page 10: Descriptive Statistics and Data Visualization

Hierarchy of Measurement

Absolute Zero

Distance is meaningful

Characteristics can be ordered

Classification is arbitrary

Ratio

Interval

Ordinal

Nominal

Trochim (2001)

Page 11: Descriptive Statistics and Data Visualization

Descriptive vs. Inferential Statistics

Descriptive (Summary) statistics describe or characterize data in such a way that none of the original information is lost or distorted1

Inferential statistics allow one to draw conclusions about a population based on data obtained from a sample

Munro (2002)

S1 S2

S3 S4

S5

S6

?

???

??

Sample Population

Page 12: Descriptive Statistics and Data Visualization

Univariate Descriptive Analysis

Allows one to examine each variable separately to check for data inconsistencies, variability of variables

Also allows one to check statistical assumptions about the shape of the distribution before moving on to more complex analysis

Univariate descriptive statistics can also be used to determine central tendency, variability, skewness, and kurtosis

Page 13: Descriptive Statistics and Data Visualization

Graphical Methods for Displaying Data

Frequency Distributions Histograms Plots Pareto Charts Boxplots Error Bar Charts

Page 14: Descriptive Statistics and Data Visualization

Frequency distributions

Frequency distributions are a nice tool for categorizing data into meaningful groups

Organizing data in tabular form using classes or frequencies

Two main types: Categorical: qualitative data such as gender,

treatment group or not, religious affiliation Ungrouped or grouped quantitative data

Page 15: Descriptive Statistics and Data Visualization

Categorical Frequency distributions

AB B A O

O O B B

A A AB AB

A B O A

Class Frequency f

A 5

B 4

O 4

AB 3

Total 16

Page 16: Descriptive Statistics and Data Visualization

Ungrouped Frequency distributions

Birth weight data in (oz)

32 58 64 64

67 88 88 91

93 94 94 89

98 98 100 101

103 103 155 161

Page 17: Descriptive Statistics and Data Visualization

Ungrouped Frequency distributions

Birth weight Count (Frequency f)

32 1

58 1

64 2

67 1

88 2

91 1

93 1

Page 18: Descriptive Statistics and Data Visualization

Grouped Frequency Distribution

Grouped frequency distribution is obtained by constructing classes (intervals) for the data

If the difference between minimum and maximum values exceed 15 then you need to divide the data into classes

Should have a minimum of 5 classes and a maximum of 20

Histogram is a graphical representation of a frequency distribution

Page 19: Descriptive Statistics and Data Visualization

Grouped Frequency Distribution

Typically grouped frequency distributions will contain: The frequency of the value within each category Relative frequency: The percentage of values within

each category based on the total number of cases Valid percent is the percentage of cases in each

category based on non-missing scores Cumulative frequency: sum of the frequencies for all

values at or below the given value Cumulative relative frequency: sum of the relative

frequencies for all values at or below the given value

Page 20: Descriptive Statistics and Data Visualization

Grouped Frequency Distribution of CA patients

Age Frequency

rf* cf crf

0 – 10 2 0.0696 2 .0696

10 – 20 71 .2473 73 .2542

20 – 30 59 .2055 132 .4597

30 – 40 70 .2439 202 .7036

40 – 50 43 .1498 245 .8534

More 42 .1463 287 .9997

Total 287 1.00

*=(E2/$E$8)*100, in Excel to force absolute reference

Page 21: Descriptive Statistics and Data Visualization

Table Tips

Use tables to highlight major facts Keep it simple – tables are usually intended

to demystify your data, not make it more difficult to understand

If you are using a software program to create class intervals make sure the default works with you data

Think of your audience – how can I convey my message without losing important data

Page 22: Descriptive Statistics and Data Visualization

Table Tips

The clustering that best describes the data should be the ultimate guide

Too few or too many class intervals will obscure important information about your data

Tables used to analyzed data are rarely published

Page 23: Descriptive Statistics and Data Visualization

Charts

Effective way to give the reader a snapshot of the differences and patterns in a set of data

Primary disadvantage to charts is that you lose the details

Things to consider when constructing charts Does my data represent a single moment in time

(cross sectional) or does my data occur over time (time series)

Do I have a qualitative or quantitative variables? If my variable is quantitative, is the variable

discrete or continuous?

Munro (2002)

Page 24: Descriptive Statistics and Data Visualization

Bar Charts

For nominal or ordinal data use simple bar charts Simple bar charts you will have spaces between

categories Cluster bar charts can be used to represent

univariate distributions Cluster bar charts can also be stacked

Page 25: Descriptive Statistics and Data Visualization

Cyt

opla

sm

Pla

sma

Mem

bran

e

Ext

race

llula

r S

pace

Nuc

leus

Location

Simple Bar Chart

Nominal data

Page 26: Descriptive Statistics and Data Visualization

Stacked Bar Chart

You are really just stacking two or more columns into a single new column

Compares the percentage that each group contributes to the total across categories

Want to have 100% stacked columns so you can compare the percentages in each group

Page 27: Descriptive Statistics and Data Visualization

Stacked Bar Chart

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9

X3

X2

X1

Page 28: Descriptive Statistics and Data Visualization

Histograms

Best for interval and ratio data Represent percentages rather than counts Each histogram has total area of 100% Since this is a range of values no gaps

between bars From a descriptive standpoint allows one to

look at the distribution of variables Consider grouping the data if range > 15 Height of the vertical axis is important

Page 29: Descriptive Statistics and Data Visualization

G-protein coupled receptor

cytokine

enzyme

growth factor

ion channel

kinase

ligand-dependent nuclear receptor

peptidase

phosphatase

transcription regulator

translation regulator

transmembrane receptor

transporter

16

14

100

12

16

68

10

24

14

107

1

35

57

25 50 75 100

Histogram of Family Terms

Page 30: Descriptive Statistics and Data Visualization

HistogramStd Err Bars

Normal Dist Fit

Page 31: Descriptive Statistics and Data Visualization

Histogram: SEM and Normal Distributions

Standard error of the mean is the estimate of how much we would expect the mean to vary in a population, given repeated samples

Fit distribution (Normal) estimates the parameters of the normal distribution based on the analysis sample

Page 32: Descriptive Statistics and Data Visualization

Pareto Charts

Pareto chart is a special type of histogram that is arranged from largest to smallest

Allows one to determine which values are least important and which values are more important

Pareto charts combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories

Page 33: Descriptive Statistics and Data Visualization

Pareto Chart

0

10

20

30

40C

ount

45.16

70.97

80.65

87.10

93.5596.77

N=31

contamination oxide defect miscellaneous corrosion metallization doping silicon defect

failure

0

25

50

75

100

125

Cum

Per

cent

SAS (1990)

Page 34: Descriptive Statistics and Data Visualization

afte

r

0

5

10

15

20

25C

ount

OCT 1 OCT 2 OCT 3

0

20

40

60

80

100

Cum

Per

cent

befo

re

0

5

10

15

20

25

Cou

nt

cont

amin

atio

n

met

alliz

atio

n

corr

osio

n

mis

cella

neou

s

silic

on d

efec

t

oxid

e de

fect

dopi

ng

failure

cont

amin

atio

n

met

alliz

atio

n

corr

osio

n

mis

cella

neou

s

silic

on d

efec

t

oxid

e de

fect

dopi

ng

failure

cont

amin

atio

n

met

alliz

atio

n

corr

osio

n

mis

cella

neou

s

silic

on d

efec

t

oxid

e de

fect

dopi

ng

failure

0

20

40

60

80

100

Cum

Per

cent

2-Way Comparative Pareto Chart

SAS (1990)

Page 35: Descriptive Statistics and Data Visualization

0

20

40

60

80

100

120

Y

-5 0 5 10 15 20 25 30 35

April

Y Test1Test2

Overlay Chart Similar to a scatterplot but…your are only

looking at one variable

SAS (1989–2004)

Page 36: Descriptive Statistics and Data Visualization

Plots

Scatterplots look at the relationship between two or more variables

Great way to identify outliers Typically the Y-axis is the DV and X-axis the

IV Using a control variable allows one to

identify different groups For example, the relationship between bp

and weight, and controlling for smoking vs. non-smoking

Page 37: Descriptive Statistics and Data Visualization

Plots

Scatterplots look at the relationship between two or more variables

Great way to identify outliers Typically the Y-axis is the DV and X-axis the

IV Using a control variable allows one to

identify different groups For example, the relationship between bp

and weight, and controlling for smoking vs. non-smoking

Why? Because we are controlling for some factor

Page 38: Descriptive Statistics and Data Visualization

Simple Scatterplot

20

30

40

50

60

70

80

90

100

Hum

id1:

PM

0 2.5 5 7.5 10 12.5 15

wrSpeed SAS (1989–2004)

Page 39: Descriptive Statistics and Data Visualization

Simple Scatterplot

20

30

40

50

60

70

80

90

100

Hum

id1:

PM

0 2.5 5 7.5 10 12.5 15

wrSpeed

In correlation, this is the least-square line (scary math, but very important)

SAS (1989–2004)

Page 40: Descriptive Statistics and Data Visualization

Box-and-Whisker Plots

A graphical method based on percentiles Useful for visualizing the distribution of a

variable Simultaneously displays the median, the IQR,

and the smallest and largest values for a group More compact than a histogram but less

revealing Good tool for identifying outliers and extreme

values Two common types: Outlier Box Plot and a

Quantile Box Plot

Page 41: Descriptive Statistics and Data Visualization

Outlier Box Plot

0 1 2 3 4 5

Possible Outliers

IQRLargest value not an outlier

Smallest value not an outlier

75th

25th

50th (media

n)

Page 42: Descriptive Statistics and Data Visualization

0

1

2

3

4

5

100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%

maximum

quartilemedianquartile

minimum

4.7605 4.7605 3.9211 1.8560 1.0298 0.4325 0.1671 0.0451 0.00640.000410.00041

Quantiles

Quantile Box Plot

Page 43: Descriptive Statistics and Data Visualization

Contact Information

Douglas J. Joubert, MLISBiomedical Informationist

National Institutes of Health LibraryBldg. 10, Room 1L09A

Bethesda, MD 20906-1150Phone: 301.594.6282

Fax: 301.402.0254E-mail: [email protected]: [email protected]

http://nihlibrary.nih.gov/

Page 44: Descriptive Statistics and Data Visualization

References

1. Johnson, Laura Lee Ph.D (2004). Principles and Practices of Clinical Research (Lecture), NIH.

2. SAS (1990). Common causes of failure during the fabrication of integrated circuits. Data from "Selected SAS/QC Software Examples, Release 6.06, SAS Users Group International Conference, April 2, 1990 pg 383.

3. Munro, B. H. (2001). Statistical methods for health care research (4th ed.). Philadelphia: Lippincott Williams & Wilkins.

4. SAS Institute Inc. (1989-2004). SAS Help Files. Cary: North Carolina.