29
BIO5312 Biostatistics Lecture 1: Introduction Yujin Chung August 30th, 2016 Fall 2016 Yujin Chung Lec1: Descriptive Statistics Fall 2016 1/29

BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

BIO5312 BiostatisticsLecture 1: Introduction

Yujin Chung

August 30th, 2016

Fall 2016

Yujin Chung Lec1: Descriptive Statistics Fall 2016 1/29

Page 2: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Basic information

Read the syllabus in the blackboard (learn.temple.edu)

Instructors:I Junchao Xia: [email protected] Yujin Chung: [email protected] William Flynn: [email protected]

Weekly homework: posted on Tuesday in the blackboard, duefollowing Tuesday before the class. Upload your answer to theblackboard:

I A document file (pdf, word etc) including your answersI R code fileI extra files (e.g., plots) if necessary

Final grade: all homework will contribute equally to the finalgrade. No exam.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 2/29

Page 3: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

What are Statistics and Biostatistics?

From WikipediaStatistics is the study of the collection, analysis, interpretation,presentation, and organization of data.

Statistics is the science whereby inferences are made aboutspecific random phenomena on the basis of relatively limitedsample material.

I Data exploration and analysisI Quantification of uncertainty with probability

Biostatistics is the branch of applied statistics that appliesstatistical methods to medical and biological problems. Given acertain biostatistical application, standard methods do not applyand must be modified. In this circumstance, biostatisticians areinvolved in developing new methods.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 3/29

Page 4: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Example 1: Effects of lead exposure

Research question: what are the effects of exposure to lead onneurological and psychological function in children?Data:

Children who lived near a lead smelter in El Paso, TexasI Exposed group: 46 children with blood levels of lead ≥ 40µg/mLI Control group: 78 children with blood levels of lead < 40µg/mL

Measures of neurological functionI the number of finger-wrist taps

Measures of psychological functionI IQ

More data such as gender, age etc

Yujin Chung Lec1: Descriptive Statistics Fall 2016 4/29

Page 5: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Example 1: Effects of lead exposure

Research question: what are the effects of exposure to lead onneurological and psychological function in children?

Neurological function:

Are the numbers of finger-wrist taps in the control and exposedgroups are different?

What would be a good measure that represents/summarizes thenumbers of finger-wrist taps?

Let’s say we use the mean. Are they different? How difference is areal difference?

Yujin Chung Lec1: Descriptive Statistics Fall 2016 5/29

Page 6: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Goals for the course

Basics of statisticsI Probability distributionsI Statistics inference: confidence intervals, hypothesis testingI Modern statistical practice

Statistical graphics

Using R for statistical analysis

Yujin Chung Lec1: Descriptive Statistics Fall 2016 6/29

Page 7: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Descriptive Statistics

Descriptive statistics: numeric or graphic display of data

to describe the data in some concise manner.

to indicate principal trends in data

Purpose: Initial data analysis and exploratory data analysis

Identifying missing values, outliers, errors etc

checking assumptions required for model fitting and hypothesistesting

finding trends and patterns in data that merit further study

possibly formulate new hypotheses

Yujin Chung Lec1: Descriptive Statistics Fall 2016 7/29

Page 8: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Types of Data

Continuous data forms a continuum

Example: blood pressure, IQ

Discrete data

Count data: obtained by countingI ex) the number of births/deaths

Categorical dataI nominal data has two or more categories

ex) blood types, genotypesI binary data: two categories or levels

ex) exposed group vs control group, genderI ordinal data has ordered or ranked categories

ex)

A mix of them?

Ambiguities in classifying a type of data!Q) What is the type of the number of finger-wrist taps?

Yujin Chung Lec1: Descriptive Statistics Fall 2016 8/29

Page 9: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location

Measure of location is a type of measure useful for datasummarization that defines the center or middle of the sample(x1, . . . , xn) of size n.• Continuous or count data

Arithmetic mean: x =1

n

n∑i=1

xi

Median:

I if n is odd, the

(n+ 1

2

)th largest observation;

I if n is even, the average of(n

2

)th and

(n2

+ 1)

th largest

observation.

Geometric mean: the nth root of the product of the sample

• Discrete data

Mode: the most frequently occurring value among all theobservations in a sample

Yujin Chung Lec1: Descriptive Statistics Fall 2016 9/29

Page 10: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location: Arithmetic mean

Arithmetic mean: x =1

n

n∑i=1

xi

The sum of all the observations divided by the number of observations.

Linear transformation: for constant a and c,I If yi = xi + c for i = 1, . . . , n, y = x+ c (location)I If yi = axi for i = 1, . . . , n, y = ax (scale)I If yi = axi + c for i = 1, . . . , n, y = ax+ c (location & scale)

Oversensitive to extreme values; in which case, it may not berepresentative of the location of the majority of sample points.

I The mean of sample (1, 2, 3, 4, 5) is x = 3I The mean of sample (1, 2, 3, 4, 100) is x = 22

Yujin Chung Lec1: Descriptive Statistics Fall 2016 10/29

Page 11: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location: Median

Median: Suppose there are n observations in a sample. If theseobservations are ordered from smallest to largest, then the median isdefined as follows:

if n is odd, the

(n+ 1

2

)th largest observation;

if n is even, the average of(n

2

)th and

(n2

+ 1)

th largest

observation.

The rationale for these definitions is to ensure an equal number ofsample points on both sides of the sample median.

Resistant to extreme valuesI The median of sample (1, 2, 3, 4, 5) is 3I The mean of sample (1, 2, 3, 4, 100) is 3

Yujin Chung Lec1: Descriptive Statistics Fall 2016 11/29

Page 12: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location: Geometric mean

Geometric mean: the nth root of the product of the sample

n

√√√√ n∏i=1

xi = exp

{1

n

n∑i=1

log xi

}= exp

{log x

}The geometric mean is used when a logarithmic transformation isappropriate (for example, when the distribution has a long right tail).

The inequality of arithmetic and geometric means: the arithmeticmean of a list of non-negative real numbers is greater than or equal tothe geometric mean of the same list; and further, that the two meansare equal if and only if every number in the list is the same.

n

√√√√ n∏i=1

xi ≤ x

Yujin Chung Lec1: Descriptive Statistics Fall 2016 12/29

Page 13: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Comparisons of arithmetic means, median andgeometric mean

Symmetric

Fre

quen

cy

97 98 99 100 101 102 103

020

4060

8010

0

AM = 100.109median = 100.153GM = 100.104

Positively skewed

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

250

300

AM = 0.101median = 0.07GM = 0.054

Negatively skewed

Fre

quen

cy

0 50 100 150

020

4060

8010

012

0

AM = 108.408median = 113.646GM = 101.174

Yujin Chung Lec1: Descriptive Statistics Fall 2016 13/29

Page 14: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location: mode

Mode: the most frequently occurring value among all the observationsin a sampleCount data: the mode is 28,

Nominal data: the mode is blood type O,

blood type O A B AB

% 45% 33% 17% 5%

Data distributions may have one or more modes.

One mode = unimodal; Two modes = bimodal; Three modes =trimodal and so on.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 14/29

Page 15: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location

The different possible measures of the “center” of the distribution areall allowable.

Which is the best measure of the “typical” value (for yoursituation)?

Be sure to make clear which “average” you use.

IQ

Fre

quen

cy

40 60 80 100 120 140

05

1015

2025

3035

finger−wrist tapping

Fre

quen

cy

20 40 60 80 100

05

1015

2025

30

Yujin Chung Lec1: Descriptive Statistics Fall 2016 15/29

Page 16: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Location

The different possible measures of the “center” of the distribution areall allowable.

Which is the best measure of the “typical” value (for yoursituation)?

Be sure to make clear which “average” you use.

IQ

Fre

quen

cy

40 60 80 100 120 140

05

1015

2025

3035

AM = 91.08median =91GM=89.91

finger−wrist tapping

Fre

quen

cy

20 40 60 80 100

05

1015

2025

30

AM = 61.44median =56GM=57.289

Yujin Chung Lec1: Descriptive Statistics Fall 2016 16/29

Page 17: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread

Continuous and count data

Range: the difference between the largest and smallestobservations in a sample

Quartiles/Percentiles

Interquartile range

Variance and standard deviation

Coefficient of variance (CV)

Yujin Chung Lec1: Descriptive Statistics Fall 2016 17/29

Page 18: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Range

Range: the difference between the largest and smallest observations ina sample

Range is very sensitive to extreme observations.I The range of sample (1, 2, 3, 4, 5) is 5− 1 = 4I The range of sample (1, 2, 3, 4, 100) is 100− 1 = 99

Larger the sample size (n), the larger the range and the moredifficult the comparison between ranges from data sets of varyingsizes.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 18/29

Page 19: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Quartiles/Percentiles

The pth percentile is defined by

The (k + 1)th largest sample point if np/100 is not an integer(where k is the largest integer less than np/100)

The average of the (np/100)th and (np/100 + 1)th largestobservations if np/100 is an integer.

Quartiles: 1st quartile (Q1, 25th percentile), 2nd quartile (Q2, 50thpercentile, median), 3rd quartile (Q3, 75th percentile)Less sensitive to extreme values

If sample is (1, 2, 3, 4, 5), Q1 = 2, Q2 = 3, Q3 = 4

If sample is (1, 2, 3, 4, 100), Q1 = 2, Q2 = 3, Q3 = 4

Yujin Chung Lec1: Descriptive Statistics Fall 2016 19/29

Page 20: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Interquartile range

The interquartile range (IQR) of a sample is Q3−Q1. Unlike total

range, the interquartile range has a breakdown point of 25%, and isthus often preferred to the total range. Less sensitive to extreme

values, robust measure of spread

If sample is (1, 2, 3, 4, 5), IQR = 4− 2 = 2

If sample is (1, 2, 3, 4, 100), IQR = 4− 2 = 2

Yujin Chung Lec1: Descriptive Statistics Fall 2016 20/29

Page 21: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Variance and standard deviation

Deviations are the difference between individual sample points andthe arithmetic mean is needed; that is, x1 − x, x2 − x, . . . , xn − x.

Variance (s2) is the average of the squares of the deviations from thesample mean.

s2 =1

n− 1

n∑i=1

(xi − x)2

A rationale for using n− 1 in the denominator rather than n ispresented in the discussion of estimation in Chapter 6.

Standard deviation is s =√s2 =

√√√√ 1

n− 1

n∑i=1

(xi − x)2

Yujin Chung Lec1: Descriptive Statistics Fall 2016 21/29

Page 22: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Variance and standard deviation II

Linear transformation: for constant a and c,

If yi = xi + c for i = 1, . . . , n, s2y = s2x, sy = sx (location)

If yi = axi for i = 1, . . . , n, s2y = a2s2x, sy = asx (scale)

If yi = axi + c for i = 1, . . . , n, s2y = a2s2x, sy = asx (location &scale)

−6 −4 −2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

Pro

port

ion

locationscale

Yujin Chung Lec1: Descriptive Statistics Fall 2016 22/29

Page 23: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Measures of Spread: Coefficient of Variation (CV)

Coefficient of Variation (CV) is s/x× 100%.

Unit free

Useful in comparing variability of different samples with differentarithmetic means

Useful for comparing the reproducibility of different variables

Yujin Chung Lec1: Descriptive Statistics Fall 2016 23/29

Page 24: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Graphic Methods

Continuous data

Box plots

Histograms

Scatter plots

Discrete data

Bar graphs

Yujin Chung Lec1: Descriptive Statistics Fall 2016 24/29

Page 25: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Graphic Methods: Bar graphs

Bar plot is a chart that presents categorical or count data withrectangular bars with lengths proportional to the values that theyrepresent.

O A B AB

Bar plot

010

2030

40

Yujin Chung Lec1: Descriptive Statistics Fall 2016 25/29

Page 26: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Graphic Methods: Box plots

Box plot is a standard way of displaying the distribution of databased on the five number summary: minimum, first quartile, median,third quartile, and maximum.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 26/29

Page 27: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Graphic Methods: Box plots II

The finger-wrist tapping scores (MAXFWT) and full-scale IQ scores(IQF) seem slightly lower in the exposed group than in the controlgroup.

Yujin Chung Lec1: Descriptive Statistics Fall 2016 27/29

Page 28: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Graphic Methods: Scatter plots

Scatter plots use horizontal and vertical axes to plot data points.However, they have a very specific purpose. Scatter plots show howmuch one variable is affected by another. The relationship between twovariables is called their correlation (Chapter 5 and 11).

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

20 40 60 80 100

6080

120

scatter plot

FWT

IQ

Yujin Chung Lec1: Descriptive Statistics Fall 2016 28/29

Page 29: BIO5312 Biostatistics Lecture 1: Introduction...appropriate (for example, when the distribution has a long right tail). The inequality of arithmetic and geometric means: the arithmetic

Summary

Numeric or graphic methods for displaying data help in

quickly summarizing a data set

And/or presenting results to others

Steps for exploratory data analysis1 Identify data types

I Continuous data: means, median, quartiles, variance, box plot,scatter plot

I Categorical data: mode, contingency table, bar plotI Count data: maybe both, but it depends on the data property

2 Do not rely on one measure or one graphic display

3 Report several measures and graphs if they provide differentinformation

4 After statistical inference, come back to the numerical andgraphical analysis and confirm your inference agrees with yourdescriptive statistics!

Yujin Chung Lec1: Descriptive Statistics Fall 2016 29/29