35
Description and measurement Dr Kwang Lee Email: [email protected] 03/05/2013

Description and measurement Dr Kwang Lee Email: [email protected] 03/05/2013

Embed Size (px)

Citation preview

Page 1: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Description and measurement

Dr Kwang Lee

Email: [email protected]

03/05/2013

Page 2: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

1. Concepts of scale of measurement (types of data e.g. categorical, continuous)

2. Sampling methods, frequency and probability distributions.

3. Summary statistics and graphs, outliers, stem-and-leaf plots, Box plots, scattergrams.

Outline

Page 3: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Scales of Measurement: categorical data

Nominal Scale - Labels represent various levels of a categorical variable.

Gender, Ethnicity, or Marital Status. Statistical test: chi square

Ordinal Scale - Labels represent an order that indicates either preference or ranking.

quality of food (0, 1, or 2) etc

statistical tests: Spearman's Rank Order Correlation (rho), Mann-Whitney U

* Nominal (unordered; male, female) vs ordinal (ordered; food quality score 0,1,2,3)

Page 4: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Scales of Measurement: continuous data

Interval Scale - Numerical labels indicate order and distance between elements. There is no absolute zero and multiples of measures are not meaningful.

Most personality measures & scale scores statistical tests: t-test, ANOVA, regression, factor analysis etc

Ratio Scale - Numerical labels indicate order and distance between elements. There is an absolute zero and multiples of measures are meaningful.

Length or distance in centimeters, inches etc that have the absolute zero.

Page 5: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Most personality measures & scale scores

Ordinal vs. interval scale

Page 6: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Classify the data according to the level of measurement.

1. Temperature, 2. Salary, 3. time, 4. postcode, 5. grade

A ) interval, nominal, interval, ratio, interval

B ) nominal, ratio, interval, ordinal, ratio

C ) ratio, ordinal, ordinal, interval, ratio

D ) interval, ratio, ratio, nominal, ordinal

Page 7: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

A study was conducted to investigate the effect of a coal-fire generating plant upon the water quality of a river. As part of an environmental impact study, fish were captured, tagged, and released. The following information was recorded for each fish:

• sex(0=female, 1=male),• length(cm),• maturation (0=young, 1=adult),• weight(g).

The scale of these variables is:(a) nominal, ratio, nominal, ratio(b) nominal, interval, ordinal, ratio(c) nominal, ratio, ordinal, ratio(d) ordinal, ratio, nominal, ratio(e) ordinal, interval, ordinal, ratio

Page 8: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Descriptive statistics• Methods of organising, summarising, and presenting

data in a convenient and informative way. These methods include:

– numerical techniques – graphical techniques

• The actual method used depends on what information you would like to extract. Are you interested in:– measures of central location and/or– measures of variability (dispersion)?

Page 9: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Measures of central location

Page 10: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

MEAN

Mean is probably the most common indicator. The mean can be defined as as the arithmetic average of all values. The mean measures the central tendency of a variable.                  

wheren      is the sample size.

Page 11: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Median – a different kind of average

• “Middle value” – Order data – When n is odd middle value– When n is even average two middle values

05 11 21 24 27 28 30 42 50 52

median = average of 27 and 28 = 27.5

Page 12: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Median is “robust”

Robust resistant to skews and outliers

This data set has a mean (xbar) of 1600:

1362 1439 1460 1614 1666 1792 1867

This data set has an outlier and a mean of 2743:

1362 1439 1460 1614 1666 1792 9867

Outlier

The median is 1614 in both instances.

The median was not influenced by the outlier.

Page 13: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Mode

• Mode value with greatest frequency• e.g., {4, 7, 7, 7, 8, 8, 9} has mode = 7 • Used only in very large data sets• The mode is used less frequently than the mean or the

median.

Page 14: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Mean, Median, Mode

(A) Symmetrical data: mean = median

(B) positive skew: mean > median [mean gets “pulled” by tail]

(C) negative skew: mean < median

Mean Mode

Median

(A) Symmetr ical

Mode

Median

MeanMean

MedianMode

(B) Positive Skew (B) Negative Skew

Page 15: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Measures of variability

Page 16: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Range

Simplest way to describe the spread of dataset is to quote the minimum (lowest) and maximum (highest) value.

e.g.,Minimum: 116, maximum: 170: range: 54

Affected by extreme values

Page 17: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Quartiles

Quartiles divide the values of a data set into four subsets of equal size, each comprising 25% of the observations.

0.0 1.5 3.0 4.5 6.0

0

25

50

75

100

Ln_YarnS

Cum

ula

tive

Fre

que

ncy

Q1 Q2 Q3

0.0 1.5 3.0 4.5 6.0

0

25

50

75

100

Ln_YarnS

Cum

ula

tive

Fre

que

ncy

Q1 Q2 Q3

Page 18: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 5 10 15 20 25Q1 Q3

25% 25%

50%

Inter-Quartile Range

Inter-Quartile Range = IQR = Q3 - Q1

Page 19: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Variance and Standard deviation.

• The variance of a set of data is a measure of spread about the mean of a distribution.

• The variance uses all the data

• The standard deviation is the square root of the variance

Page 20: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

The Variance

Variance is one of the most frequently used measures of spread,

for population,

for sample,

2(x

i–)2

N(x

i)2 – N2

N

s2(x

i– x )2

n–1(x

i)2 –nx 2

n–1

Page 21: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

The Standard Deviation

Since variance is given in squared units, we often find uses for the standard deviation, which is the square root of variance: for a population,

for a sample,

2

s s2

Page 22: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Shape of the Distribution: Skewness

• Values need not be symmetrically distributed around the central point; distributions can be skewed

• Mean and standard deviation are insufficient to describe the distribution

Mode Mean xMedian

Freq

uenc

y

This distribution is skewed to the right (positively skewed)

Page 23: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Consequences of a Skewed Distribution

• Especially socio-economic data (wages, income, wealth and related variables) is frequently skewed

• Skewed variables can lead to undesirable effects • Test statistics and confidence intervals are biased

Þ If the variable is not significantly skewed, continueÞ If the variable is skewed, transform the variable:

For this reason you often find the logarithm of income, the square root of the mortality rate, etc.

Page 24: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Kurtosis: a measure of the "peakedness"

• Two variables with equal mean and standard deviation, and symmetrically distributed, but a different kurtosis

x,y

f(x)

m

f(y)

sx

f(y)

f(x)

sy

Here, variable y has the larger kurtosis than variable x

Page 25: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Describe Samples: graphs

Box plot and stem-and-leaf diagram,

Page 26: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Box Plot

• Visual display of– Central tendency, Variability, Departure from symmetry, Outliers– give a good graphical image of the concentration of the data.

They also show how far from most of the data the extreme values are.

26

Box-and-Whisker Plot

Valu

e

45

50

55

60

65

70

75

80

85 Max value

Third quartile

Mean

Median

First quartile

Min value

Page 27: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

STEM AND LEAF DIAGRAMS

STEM LEAVESA Stem and Leaf diagram is a way of sorting data. They look like this.

The data is split into tens (the stem) and the units (the leaves).

Page 28: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

STEM AND LEAF DIAGRAMS

STEM LEAVESWe are going to put this data into a stem and leaf diagram.

12, 32, 22, 16, 24, 34, 12, 10, 25, 30, 28

We have numbers in the tens, twenties and thirties so this becomes our stem.

1

2

3

Now we need to enter the leaves. The first number twelve has a 2 in the unit column so this becomes the leaf.

2

Page 29: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

STEM AND LEAF DIAGRAMS

STEM LEAVES

1

2

3

2

12, 32, 22, 16, 24, 34, 12, 10, 25, 30

3 2

2

The next number is 32. This has a 2 in the units column so it goes as shown.

The rest go as shown.

2

6

4

4

2 0

5

0

Key: 1 2 = 12

8

Page 30: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

STEM AND LEAF DIAGRAMS

STEM LEAVES

1

2

3

0 2 2 6

2 4 5 8

0 2 4

Key: 1 2 = 12

If an ORDERED stem and leaf diagram is required then you have to put the leaves in numerical order.

We can now use this to find the median. There are 11 pieces of data so the median is the 6th number.

Median = 24

It is a good choice when the data sets are small!

Page 31: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

• If most of the measurements in a large data set are of approximately the same magnitude except for a few measurements that are quite a bit larger, how would the mean and median of the data set compare and what shape would a histogram of the data set have?

• (a) The mean would be smaller than the median and the histogram would be skewed with a long left tail.

• (b) The mean would be larger than the median and the histogram would be skewed with a long right tail.

• (c) The mean would be larger than the median and the histogram would be skewed with a long left tail.

• (d) The mean would be smaller than the median and the histogram would be skewed with a long right tail.

• (e) The mean would be equal to the median and the histogram would be symmetrical.

Page 32: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

When extreme values are present in a set of data, which of the following descriptive summary measures are most appropriate?

(a) Coefficient variation and range.

(b) Mean and standard deviation.

(c) Median and inter-quartile range.

(d) Mode and variance.

Page 33: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

The weights of the male and female students in a class are summarized in the following boxplots:

• Which of the following is NOT correct?

(a) About 50% of the male students have weights between 150 and 185 lbs.

(b) About 25% of female students have weights more than 130 lbs.

(c) The median weight of male students is about 162 lbs.

(d) The mean weight of female students is about 120 because of symmetry.

(e) The male students have less variability than the female students.

Page 34: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

• The following is a stem-plot of the birth weights of male babies born to a group of mothers who smoked during pregnancies. The stems are in units of kg.

• The median birth weight is:

(a) 13.5 (b) 3.2 (c) 3.5 (d) 3.7 (e) Average of 13 and 14.

• The first quartile (25th) percentile of the weights is

(a) 2.3 (b) 2.7 (c) .25 (d) 6.5 (e) 2.8

Page 35: Description and measurement Dr Kwang Lee Email: k.h.lee@sheffield.ac.uk 03/05/2013

Thank you