BNG 202 – Biomechanics Laborzo.union.edu/~khetans/Teaching/BNG202/Stats Lecture 1.pdf · Central tendency measures: computed to give a “center” around which the measurements

Descriptive statistics and probability distributions I

BNG 202 – Biomechanics Lab

2

• The overall goal of this “short course” in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus on using MATLAB for implementation.

• The four modules are: – Introduction and descriptive statistics – Probability distributions – Hypothesis testing – Correlation and regression

• Each lecture will be supplemented with a MATLAB tutorial on the same topic – We will work through part or all of the tutorial after reviewing the

concepts; anything we don’t get to should be reviewed outside of class!

Overview

3

• 1663 – Natural and Political Observations upon the Bills of Mortality by John Graunt is published – Motivated by the desire to base policy on demographic data

• 1700s – Laplace introduces the normal distribution and regression via his study of astronomy

• 1800s – Quetelet applies statistical analysis to human biology

• The central purpose of statistics is to learn more about some population of interest (e.g., all humans in the world) – However, we very rarely, if ever, have access to every individual in the

population! • “sample” – a subset of the entire population • ??? – compilation of data about the entire population

– With a sample in hand, we seek to either summarize that data (using descriptive statistics) or use the data to make some prediction or statement about the population (using inferential statistics)

Statistics – a (very brief) introduction

The “Central Dogma of Statistics”

used to summarize data; (this is the focus for today)

used to make inferences about the population

Dimensionality of data sets

• Univariate: measurements made on one variable per subject.

• Bivariate: measurement made on two variables

per subject. • Multivariate: measurement made on many

variables per subject.

This will be the focus for modules 1-3

Types of descriptive statistics

• Central tendency measures: computed to give a “center” around which the measurements in the data are distributed (also called “measures of location”. (mean, median, mode, quartiles)

• Variation or variability measures: describe

“data spread” or how far the measurements are away from the center. (variance, standard dev.)

• Relative standing measures: describe the

relative position of specific measurements in the data. (percentiles)

Central tendency measures: mean

• The sample mean (a.k.a. average):

To calculate the average x of a set of observations, add their values and divide by the number of observations:

x = = Σ xi n x1 +...+ xn 1

n i = 1

n

Central tendency measures: median

• Median – the exact middle value

• Calculation – If there are an odd number of observations, find the

middle value – If there are an even number of observations, find the

mean of the middle two values

• Example: – Age of students: 17 19 21 22 23 23 23 28

Median = ave(22,23) = 22.5

Which central tendency measure is better?

• In other words, which measure better approximates the “center” of a data set?

• Mean is best for symmetric distributions w/o outliers • Median is useful for skewed distributions or data with

(one-directional) outliers

mean = 3.125

median = 3

mean = 4.857

median = 4

Scale: variance

• The sample variance is the average of squared deviations of values from the mean

• Square the deviations to get rid of the negatives

– The result is that the contribution to the variance increases as you go farther from the mean in either direction

s2 = Σ(xi – x)2

n – 1 i = 1

n

Scale: standard deviation

• Procedure to obtain the sample standard deviation: – Score/measure observations (in the units that are meaningful,

let’s say m/s) – Find the mean of the observations (m/s) – Find each score’s deviation from the mean (m/s) – Square all those deviations (m/s)2

– Divide by n – 1 (m/s)2 (note that this is the variance) – square root (m/s) – now we have the starting units!

Let’s do a simple example problem!

s = Σ(xi – x)2

n – 1 √ i = 1

n

• The mode is the observation that takes place most frequently in a data set

• Unlike the mean or median, the mode is not necessarily unique – the same maximum frequency may occur at different values.

• Based on the previous slide, is the mode a parametric statistic? (hint: remember that it is a measure of central tendency)

Central tendency measures: mode

Scale: quartiles and IQR

• Q2 is the same as the median • The first quartile (Q1) and third quartile (Q3) are the

medians of the data sets that would be created if all of the values below and above Q2, respectively, were chosen.

• The interquartile range (IQR) is Q3 – Q1

Quartiles example problem

• Find the three quartiles and IQR of the following two datasets:

2 3 6 10 12 14 15 18

2 5 9 11 13 15 19 21 24

Q1 = 4.5 Median = 11 Q3 = 14.5 IQR = 10

Q1 = 7 Median = 13 Q3 = 20 IQR = 13

It is easiest to first “insert” the median – the lower and upper halves from which to find Q1 and Q3 should then be obvious

Note from this example that the “25%” rule from the previous slide isn’t precisely correct

Percentiles (aka quantiles)

• Generally, the nth percentile is a value such that n% of the observations fall at or below it:

Q1 = 25th percentile

Median = 50th percentile

Q2 = 75th percentile

Univariate data: histograms and bar plots

• What’s the differences between a histogram and bar plot?

• Bar plot – Used for categorical variables to show frequency or

proportion in each category. – Translate the data from frequency tables into a

pictoral representation...

• Histogram – Used to visualize distribution (shape, center, range,

variation) of continuous variables – “bin size” is important

Effect of bin size on histogram • Simulated 1000 N(0,1)

– 1000 random numbers from the standard normal distribution with mean 0 and st. dev. 1

More on histograms

• What’s the difference between a frequency histogram and a density histogram?

More on histograms

• What’s the difference between a frequency histogram and a density histogram?

More on histograms

• So, for our roughly gaussian distribution from earlier, the density histogram looks like this:

mean = 3.125

median = 3

mean = 4.857

median = 4

1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

observation

rela

tive

frequ

ency

Stem and leaf plots

Box and whisker plots

An outlier is a score either 1.5 IQR above the upper quartile or below the lower quartile

Example problem

• Two different classes take a quiz and gets the following scores.

Class 1: 2, 4, 6, 8, 10, 12, 14 Class 2: 2, 2, 3, 8, 8, 10, 23

• What the mean and median of each set?

– The same! • Will making a box and whisker plot of each set of

data give us a better picture of their distributions? (let’s do the second one together)

Box plot procedure

• Steps to make our box plot: • Find the median, Q1, Q3, and IQR • Draw 3 horizontal lines, at Q1, median, and Q3 • Draw the corresponding vertical lines to make the boxes • Compute the lower inner fence (Q1 – 1.5*IQR) and the upper

inner fence (Q3 + 1.5*IQR) • Draw a whisker downward from Q1 to lower inner fence or

minimum, whichever comes first • Draw a whisker upward from Q3 to upper inner fence or

maximum, whichever comes first • Compute the lower outer fence (Q1 – 3*IQR) and the upper

outer fence (Q3 + 3*IQR) • Mild outliers fall between the inner and outer fences, mark with

“O” • Extreme outliers fall outside outer fences, mark with “*”

25

Now let’s switch over and do some work in MATLAB!

Probability Distributions

http://www-users.york.ac.uk/~pml1/bayes/cartoons/cartoon08.jpg

The “Central Dogma of Statistics”

(this is the focus for today)

i.e., the probability distribution

Probability distributions

• We’ve discussed that data can be normally distributed (a.k.a. “Gaussian” or “bell-shaped”) – in fact, many real-life variables are, including:

http://sixminutes.dlugan.com/wp-content/uploads/2010/02/height-bell-curve.jpg http://www.davinciinstitute.com/wp-content/uploads/2011/11/Voter-IQ-Bell-Curve.jpg

But what does this mean mathematically?

Probability distributions

• A probability distribution – which can either be discrete or continuous – is a table (discrete) or mathematical function (continuous) of one or more variables that describes the likelihood that any given value (discrete) or set of values (continuous) will occur

• Because the entire population is characterized, the main usefulness is in calculating the probability that certain values (discrete) or a range of values (continuous) will occur

First, let’s see examine a couple discrete cases (we’ll then move to the continuous case)

What is the probability distribution of rolling a die?

• If all outcomes are equally likely (i.e., if the die is “fair”), then:

P(1) = ? • Note the total probability is 1! • We use X (upper case) to denote an individual from the population

– For example, P(X = 2) = 1/6

xi P(xi) 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6

probability distribution:

If written as a function, we call it the probability mass function (pmf)

What is the probability distribution of a random number generator?

• Say you have a program (e.g., “rand” in MATLAB) that picks a real number between 0 and 1 (the “uniform distribution”):

• Since we still need our total area to equal 1, what must the value of f(x) be (i.e., at what y-axis value is the upper line in the graph)?

• f(x) is the probability density function (pdf) – it is the continuous analog of the pmf.

• Here, we run into a problem: if x can be any real #, what must be the probability of observing a given value – i.e., what is P(X = x) – for any continuous distribution? • Unlike in the discrete case, P(X = x) = 0 in the continuous case

f(x) = 1; 0 ≤ x ≤ 1 0; otherwise

f(x)

x

1

What is the probability distribution of a random number generator (cont.)?

• In the continuous case, we instead care about the probability of a randomly selected variable X from the distribution being within a certain range of values

• Look at f(x) above; if we want P(0.25 < X < 0.75), how can we evaluate this mathematically?

• Integrating the pdf gives the cumulative distribution function (cdf), or F(x), which is evaluated over the desired limits!

Let’s do an example – what is P(0.25 < X < 0.75)?

f(x)

x

f(x) = 1; 0 ≤ x ≤ 1 0; otherwise F(x) = x; 0 ≤ x ≤ 1 0; otherwise

1

33

The mean and variance of continuous random variables

• There are many different continuous probability distributions (here are a few examples we will see):

Normal Uniform Exponential Parabolic

• Every distribution has a unique: – Expected value: E(X) = μ a weighted average of all the possible

values that this random variable can take on – Variance: V(X) = σ2 a measure of the “spread”, or the extent to

which values in the distribution are dispersed

If we know the pdf of a given distribution, we can calculate its mean and variance!

Expected value of random variables

• Let’s re-visit our die problem; “on average”, what is the expected value of a roll, given the die goes from 1-6 (hint: it’s not one of the integers)?

• Mathematically, how would you fill in the parentheses below to arrive at the same answer?

E(X) = Σ ( )( ) • How can we express the same concept in the continuous case?

E(X) = ∫ x f(x) dx

i = 1

6

-∞

∞

Note that these limits will vary depending on the distribution, according to where f(x) is non-zero

Let’s try this out for our uniform distribution example and the generalized case!

References

• “Lecture 2 – Descriptive Statistics and Exploratory Data Analysis” University of Washington School of Medicine.

• http://www.webquest.hawaii.edu/ • http://www.sonoma.edu/users/w/wilsonst/course

s/math_300/Final/p14/default.html

http://www.webquest.hawaii.edu/

http://www.sonoma.edu/users/w/wilsonst/courses/math_300/Final/p14/default.html

http://www.sonoma.edu/users/w/wilsonst/courses/math_300/Final/p14/default.html

Documents

BNG 202 – Biomechanics Laborzo.union.edu/~khetans/Teaching/BNG202/Stats Lecture 1.pdf · Central tendency measures: computed to give a “center” around which the measurements