Upload
tylertangtengyang
View
216
Download
0
Embed Size (px)
Citation preview
8/10/2019 ST102 MT Section 2
1/19
ST102Elementary Statistical Theory
Descriptive statistics
Dr James Abdey
Department of StatisticsLondon School of Economics and Political Science
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 31
Part I: 2. Descriptive statistics
Part I:
1. Introduction.2. Descriptive statistics.
3. Introduction to probability theory.
4. Random variables.
5. Some common distributions of random variables.
6. Multivariate random variables.
7. Sampling distributions of statistics.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 32
2. Descriptive statistics
2.1: Introduction.
2.2: The sample distribution.
2.3: Measures of central tendency.
2.4: Measures of dispersion.
2.5: Associations between two variables.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 33
2.1: Introduction
Starting point: A collection of numerical data (a sample) has beencollected in order to answer some questions.
Statistical analysis may have two broad aims:
1. Descriptive statistics: Summarise the data that were collected, inorder to make the data more understandable.
2. Statistical inference: Use the observed data to draw conclusionsabout some broader population.
Sometimes 1. is the only aim.
Even when 2. is the main aim, 1. is still an essential first step.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 34
8/10/2019 ST102 MT Section 2
2/19
Need for descriptive statistics
Data do notjust speak for themselves: There are usually simply too many
numbers to make sense of just by staring at them.
Descriptive statistics attempt to summarise some key features of thedata to make them understandable and easy to communicate.
These summaries may benumerical (tables or individual summarystatistics) orgraphical.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 35
Example
Consider data for 155 countries on three things, from around 2002:
Regionof the country.
Coded as 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 =North America and 6 = Oceania.
Level of democracy in the country.
An 11-point scale from 0 (lowest level of democracy) to 10 (highest).
Gross domestic product (GDP) per capita (in $000s).
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 36
A datasetThe statistical data in a sample are typically stored in a data matrix:
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 37
Units and variables
Rowsof the data matrix correspond to different units(subjects).
Here each unit is a country.
The number of units in a dataset is the sample size, typically denoted bythe letter n.
Here n= 155 countries.
Columnsof the data matrix correspond to variables, i.e. differentcharacteristics of the units.
Here region, level of democracy and GDP per capita are the variables.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 38
8/10/2019 ST102 MT Section 2
3/19
Continuous and discrete variables
Different variables may have different properties. These determine whatkinds of statistical methods are suitable for the variables.
Acontinuousvariable can, in principle, take any real values within some(continuous) interval.
For example GDP per capita, which can have any values 0.
A variable is discreteif it is not continuous, i.e. if it can only take certain(usually integer) values, but not any others.
For example region, with possible values 1, 2, 3, 4, 5 and 6, and thelevel of democracy, with possible values 0, 1, 2, . . . ,10.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 39
Discrete variables: number of possible values
Many discrete variables have only a finitenumber of possible values. Inour example, region has 6, and level of democracy has 11 possible values.
The simplest possibility is a binary (dichotomous) variable, with just twovalues. For example, a persons sex recorded as 1 = female and 2 = male.
A discrete variable can also have an unlimited number of possible values.
For example, the number of visitors to a website in a day could be
0, 1, 2, 3, 4, . . . .
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 40
Discrete variables: ordering of the valuesIn the example, the levels of democracy have a meaningful ordering, fromless to more democratic countries.
The numbers assigned to the different levels must also be in this order: alarger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, NorthAmerica and Oceania) do not have such a natural ordering.
The numbers used for the variable Region are just labels for differentregions. A different numbering (for example 6 = Africa, 5 = Asia, 1 =Europe, 3 = Latin America, 2 = North America and 4 = Oceania) wouldbe just as acceptable as the one we used.
Some statistical methods are appropriate for variables with both orderedand unordered values, some only in the ordered case.
Unordered categories arenominal data; ordered categories are ordinaldata.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 41
Using computers for statistical analysis
For understanding and practice, we make you calculate some descriptivestatistics by hand.
However, most real statistical analysis is done with computers, usingstatistical software packages.
To give you an idea of how they work, in Exercise 1 we ask you to do somedescriptive statistics with a package called Minitab.
See a note on the ST102 Moodle site for instructions on how to useMinitab for the exercise.
There are many other statistical packages which do more or less the samething and which you may encounter in later courses: Stata, SPSS, R, SASand others.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 42
8/10/2019 ST102 MT Section 2
4/19
2.2 The sample distribution
The sample distribution of a variable consists of:
a list of the values of the variable that are observed in the sample
the number of times each value occurs (the countsorfrequenciesofthe observed values).
When the number of different observed values is small, we can show the
whole sample distribution as a frequency tableof all the values and theirfrequencies.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 43
Example: observations of region in the sample
3 1 1 4 2 6 3 2 2 2 3 3 1 2 4
1 4 3 1 2 1 1 2 1 5 1 4 2 4 1
1 4 1 3 4 2 3 3 1 4 2 4 1 4 1
1 3 1 6 3 3 1 1 2 3 1 3 4 1 1
4 4 4 3 2 2 2 2 3 2 3 4 2 2 2
1 2 2 2 3 1 1 1 3 3 1 1 2 1 1
1 4 3 2 1 1 2 1 2 3 4 1 1 3 6
2 2 4 4 4 2 6 3 3 2 3 3 1 1 2
2 1 3 1 2 3 3 3 2 1 1 3 3 2 2
2 1 2 1 4 1 2 2 2 1 3 3 4 5 24 2 2 1 1
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 44
Frequency table of region
RelativeFrequency frequency
Region (count) (%)
100
(48/155)
(1) Africa 48 31.0(2) Asia 45 29.0
(3) Europe 34 21.9
(4) Latin America 23 14.8
(5) North America 2 1.3
(6) Oceania 3 1.9
Total 155 100
Here % is the percentage of countries in a region, out of the sample of155 countries. This is a measure ofproportion (relative frequency).
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 45
Frequency table of the level of democracy
Democracy Cumulativeindex Frequency % %
0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.9
3 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.38 16 10.3 69.69 15 9.7 79.3
10 32 20.6 100Total 155 100
Cumulative % for a value of the variable is the sum of the percentagesfor that value and all lower-numbered values.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 46
8/10/2019 ST102 MT Section 2
5/19
Bar charts
Abar chart is the graphical equivalent of the table of frequencies.
Africa Asia Europe Latin
America
Northern
America
Oceania
Region
0
10
20
30
40
50
Count
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 47
Bar chart of the level of democracy
0 1 2 3 4 5 6 7 8 9 1 0
Democracy index
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
Percentage
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 48
Sample distributions of variables with manyvalues
If a variable has many distinct values, listing frequencies of all of themdoes not work well.
Solution: Group the values into non-overlapping intervals, and do a tableor graph of the frequencies within the intervals.
The most common graph used for this is a histogram.
Like a bar chart, but histograms are without gaps between bars.
A histogram often uses more bars (intervals of values) than is sensiblein a table.
Histograms are usually drawn using statistical software you can letthe software choose the intervals and the number of bars.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 49
Table of frequencies for GDP per capita
GDP per capita($000s) Frequency %
Less than 2.0 49 31.62.0 to 4.9 32 20.65.0 to 9.9 29 18.710.0 to 19.9 21 13.520.0 to 29.9 19 12.330.0 or more 5 3.2
Total 155 100
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 50
8/10/2019 ST102 MT Section 2
6/19
8/10/2019 ST102 MT Section 2
7/19
2.3 Measures of central tendency
Frequency tables, bar charts and histograms aim to summarise the wholesample distribution of a variable.
Next we consider descriptive statistics which summarise onefeature of thesample distribution in a single number: summary statistics.
We begin with measures of central tendency. These answer thequestion: Where is the centre or average of the distribution?. Weconsider:
the mean (arithmetic mean or average)
the median
the mode.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 55
Preliminaries: notation for variables
In formulae, a generic variable is denoted by a single letter.
In these course notes, usually X.
Any other letter (Y, W, etc.) can also be used, as long as it is usedconsistently.
A letter with a subscript denotes a single observation of a variable.
For example, we use Xito denote the value ofXfor unit i, where ican take values 1, 2, 3, . . . , n, and n is the sample size.
Therefore, the n observations ofX in the dataset (the sample) areX1, X2, X3, . . . ,Xn. These can also be written as Xi, i= 1, . . . , n.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 56
Preliminaries: summation notationLetX1, X2, . . . ,Xn (i.e. Xi, i= 1, . . . ,n) be a set ofn numbers. The sumof the numbers is written as:
ni=1
Xi=X1+X2+ +Xn.
This may also be written as iXi or just Xi.Other versions of the same idea:
Infinite sums:i=1
Xi =X1+X2+ .
Sums of sets of observations other than 1 to n, for example:
n/2i=2
Xi =X2+ X3+ +Xn/2.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 57
Properties of the summation operator
Here Xi and Yi (i= 1, . . . , n) are sets ofn numbers.
Here a denotes a constant, i.e. a number with the same value for all i.
All of the following results follow simply from the properties of addition (ifyou are still not convinced, try them with n= 3).
(1)n
i=1a= n a.
Proof:n
i=1a =
ntimes (a+ +a) =n a.
(2) iaXi=a iXi.Proof:
iaXi = (aX1+ + aXn) =a(X1+ + Xn) =a
iXi.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 58
8/10/2019 ST102 MT Section 2
8/19
Properties of the summation operator
(3)
i(Xi+Yi) =
iXi+
iYi.
Proof: Rearranging the elements of the summation, we get:i
(Xi+Yi) = [(X1+Y1) + (X2+Y2) + + (Xn+Yn)]
= [(X1+X2 +Xn) + (Y1+Y2+ +Yn)]
= (X1+X2+ +Xn) + (Y1+Y2+ +Yn)
= i
Xi+i
Yi.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 59
Extension: double (triple etc.) summation
Sometimes sets of numbers may be indexed with two (or even more)subscripts, for example as Xij, i= 1, . . . , n, j= 1, . . . ,m.
Summation over both indices is written as:
ni=1
mj=1
Xij =n
i=1
(Xi1+ +Xim)
= (X11+ +X1m) + (X21+ +X2m)+ + (Xn1+ +Xnm).
The order of summation can be changed, that is:
ni=1
mj=1
Xij=m
j=1
ni=1
Xij.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 60
Product notation
The analogous notation for the productof a set of numbers is:
ni=1
Xi=X1 X2 Xn.
It follows from the properties of multiplication that, for example:
1.n
i=1aXi=a
n
ni=1
Xi
.
2.n
i=1a= an.
3.n
i=1XiYi= n
i=1Xi n
i=1Yi.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 61
The sample mean
Thesample mean (arithmetic mean, mean or average) is the mostcommon measure of central tendency.
The sample mean of a variable X is denoted as X.
It is the sum of the observations divided by the number of observations(sample size):
X=
ni=1
Xi
n .
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 62
8/10/2019 ST102 MT Section 2
9/19
The sample mean
For example, the mean X=
iXi/n of the numbers 1, 4 and 7 is:
X =1 + 4 + 7
3 =
12
3 = 4.
For the variables in the country example:
The level of democracy: X = 5.3.
GDP per capita: X = 8.6 (in $000s).
Region: the mean is not meaningful, because the values of thevariable do not have a meaningful ordering.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 63
Frequency table of the level of democracy
Value of the levelof democracy Frequency Cumulative(Xj) (fj) % %
0 35 22.6 22.6
1 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.3
8 16 10.3 69.69 15 9.7 79.310 32 20.6 100
Total 155 100
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 64
Mean from a frequency tableIf a variable has a small number of distinct values, X is easy to calculatefrom the frequency table.
For example, the level of democracy has just 11 different values, whichoccur in the sample 35, 12, . . . , 32 times each, respectively.
Suppose X has K different values X1, X2, . . . ,XK, with correspondingfrequencies f1, f2, . . . , fK. Then
Kj=1
fj=n and:
X =
Kj=1
fjXj
K
j=1fj
= f1X1+ +fKXK
f1+ +fK=
f1X1+ +fKXKn
.
In our example, the mean level of democracy (where K= 11) is:
X =35 0 + 12 1 + + 32 10
35 + 12 + 4 + + 32 =0 + 12 + 8 + + 320
155 5.3.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 65
Why is the mean a good summary of centraltendency?
Consider the following small dataset:
Deviations:
from X (= 4) from Median (= 3)
i Xi Xi X (Xi X)2 Xi 3 (Xi 3)21 1 3 9 2 42 2 2 4 1 13 3 1 1 0 04 5 +1 1 +2 45 9 +5 25 +6 36
Sum 20 0 40 +5 45X = 4
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 66
8/10/2019 ST102 MT Section 2
10/19
The sum of deviations from the mean is 0The mean is in the middle of the observations X1, . . . ,Xn, in the sensethat positive and negative values of the deviations Xi X cancel out,when summed over all the observations, that is:
n
i=1 (X
i X) = 0.
Proof: [The proof uses the definition ofX and properties of summationintroduced earlier. Note that Xis a constant in the summation, because ithas the same value for all i.]
n
i=1 (XiX) =
n
i=1 Xin
i=1 X =n
i=1 Xi n X
=n
i=1
Xi n
ni=1
Xi
n =
ni=1
Xin
i=1
Xi= 0.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 67
Mean minimises the sum of squared deviationsThe smallest possible value of the sum of squared deviations
ni=1
(Xi C)2
for any constant Cis obtained when C= X.
Proof:(Xi C)2 = (Xi=0
X+ XC)2 = [(Xi X) + (X C)]2=
[(Xi X)2 + 2(Xi X)(X C) + (X C)2]
=
(Xi X)2 +
2(Xi X)(X C) +
(X C)2
=
(Xi X)2 + 2(X C)
=0
(Xi X) +n(X C)2
= (Xi X)2 +n(X C)2
(Xi X)2
since n(X C)2 0 for any choice ofC. Equality is obtained only whenC= X, so that n(X C)2 = 0. ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 68
The (sample) median
LetX(1),X(2), . . . ,X(n) denote the sample values ofXordered from thesmallest to the largest, such that:
X(1) is the smallest observed value (the minimum) ofX
X(n) is the largest observed value (the maximum) ofX.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 69
The (sample) median
The (sample) median, q50, of a variable X is the value that is in themiddle of the ordered sample.
Ifn is odd, q50= X((n+1)/2).
For example, ifn= 3, q50= X(2): (1) (2) (3)
Ifn is even, q50= [X(n/2)+X(n/2+1)]/2.
For example, ifn= 4, q50= [X(2)+X(3)]/2: (1) (2) (3) (4)
In the country example, n= 155, so q50= X(78). For the level ofdemocracy, the median is 6.
From a table of frequencies, the median is the value for which thecumulative percentage first reaches 50% (or, if a cumulative % is exactly50%, the average of the corresponding value ofXand the next-highervalue).
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 70
8/10/2019 ST102 MT Section 2
11/19
Example: ordered values of level of democracy
(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 71
Median from the frequency table of level of
democracy
Value of level ofdemocracy Frequency Cumulative(Xj) (fj) % %
0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.38 16 10.3 69.69 15 9.7 79.310 32 20.6 100
Total 155 100
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 72
The mean is sensitive to outliers
For the following sample, the mean and median are both 4:
1 2 4 5 8.
If we add one observation to get the sample:
1 2 4 5 8 1 0 0
then the median is now 4.5and the mean is now 20.
In general the mean is affected much more than the median by outliers,i.e. unusually large or small observations.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 73
Skewness, means and medians
The mean, more than the median, is pulled toward the longer tail of thesample distribution.
For a positively skewed distribution, the mean is larger than themedian.
For a negatively skewed distribution, the mean is smaller than themedian.
For an exactly symmetric distribution, the mean and median are equal.
When summarising variables with skewed distributions, it is useful to
report both the mean and the median.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 74
8/10/2019 ST102 MT Section 2
12/19
Mean and median: examples
Median MeanLevel of democracy (p. 46) 6 5.3
GDP per capita (p. 50) 4.7 8.6
Blood pressures (p. 53) 73.5 74.2
Examination marks (p. 54) 60.5 59.7
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 75
Other measures of central tendency: the mode
The (sample) modeof a variable is the value which has the highestfrequency (i.e. appears most often) in the data.
For example, in the country example the mode of region is 1 (Africa) and
the mode of the level of democracy is 0.
The mode is not very useful for continuous variables which have manydifferent values, such as GDP per capita in the country example.
A variable can have several modes (i.e. be multimodal). For example,GDP per capita in the example has modes 0.8 and 1.9, both with 5countries out of the total sample of 155 countries.
The mode is the only measure of central tendency which can be used evenwhen the values of a variable have no ordering, such as for the regionvariable in the example.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 76
Geometric and harmonic meansThegeometric mean G is defined as:
G=
ni=1
Xi
1/n
and the harmonic mean H as:
H=
ni=1
X1i /n
1=
nn
i=1(1/Xi)
.
Neither is used very often. Both are examples of the general formula:
g
1 i
g(Xi)/nwhere gis an invertible function and g1 its inverse function. We obtainX with g(x) =x, G with g(x) = log(x) and Hwith g(x) = 1/x.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 77
2.4 Measures of dispersion (variation)
Central tendency is not the whole story. The following two sampledistributions have the same mean:
...but they are clearly not the same. In one (red) the values have moredispersion (variation) than in the other.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 78
8/10/2019 ST102 MT Section 2
13/19
A small example again
Deviations from X
i Xi X2i Xi X (Xi X)2
1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25
Sum 20 120 0 40X = 4 = X
2i = (Xi
X)2
The first measures of dispersion, the sample variance and its square root,the sample standard deviation, are based on (Xi X)2, the squareddeviations from the mean.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 79
Sample variance
Thesample varianceof a variable X, denoted S2
(orS2X), is defined as:
S2 =
ni=1
(Xi X)2
n 1 .
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 80
Sample standard deviationThe sample standard deviation (s.d. for short) ofX, denoted S (orSX),is the square root of the sample variance, i.e. we have:
S=
ni=1
(Xi X)2
n
1
.
This is the most commonly used measure of dispersion. The standarddeviation is more understandable than the variance because it is expressedin the same units as X (rather than X2).
A rule-of-thumb for interpretation is that for a symmetric distributionoften:
about 2/3 of the observations are between X
S and X+S
about 95% of the observations are between X 2S and X+ 2S.
Remember that standard deviations (and variances) are nevernegative andthey are zero onlyif all the observations Xiare the same.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 81
An alternative formula for the varianceThe sum of squares in S2 can also be expressed as:
ni=1
(Xi X)2 =n
i=1
X2i n X2
Proof:
ni=1
(Xi X)2 =n
i=1
(X2i 2XiX+X2)
=n
i=1X2i 2X
=nX n
i=1Xi+
=nX2 n
i=1X2
=n
i=1
X2i nX2.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 82
A l i f l f h i
S l i l f l l i
8/10/2019 ST102 MT Section 2
14/19
An alternative formula for the variance
The sample variance can therefore also be calculated as:
S2 =
ni=1
X2i n X2
n
1
(and the standard deviation S=
S2 again).
This formula is most convenient for calculations by hand.
If using a frequency table, we can also calculate:
S2 =
Kj=1
fjX2j n X2
n 1(see p. 66 for the analogous formula for the mean).
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 83
Sample variance: example of calculations
Deviations from X
i Xi X2i Xi X (Xi X)2
1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25
Sum 20 120 0 40X = 4 =
X2i =
(Xi X)2
We have:
S2 =
(Xi X)2
n 1 =40
4= 10 =
X2i nX2
n 1 =120 5 42
4
and S=
S2 =
10 = 3.16.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 84
Sample quantiles
The median, q50, is basically the value which divides the sample into thesmallest 50% of observations and the largest 50% of observations.
If we consider other percentage splits, we get other (sample) quantiles(percentiles) qc, for example:
thefirst quartile, q25, is the value which divides the sample into thesmallest 25% of observations and the largest 75% of observations.
the third quartile, q75, for the 7525 split
the extremes in this spirit are the minimum X(1) (the 0% quantile,so to speak) and maximum X(n) (the 100% quantile).
These are no longer in the middle of the sample, but they are moregeneral measures of location of the sample distribution.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 85
Calculation of sample quantilesThis is how computer software calculates general sample quantiles (or howyou can do so by hand, if you ever needed to).
Suppose we need to calculate the cth sample quantile,qc, where0< c
8/10/2019 ST102 MT Section 2
15/19
Quantile-based measures of dispersion
Two measures based on quantile-type statistics are:
Range: X(n) X(1) = maximum minimum.Interquartile range (IQR):q75 q25= third quartile first quartile.
The range is clearly extremely sensitive to outliers, since it depends onnothing but the extremes of the distribution.
The IQR focuses on the middle 50% of the distribution, so it is completely
insensitive to outliers.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 87
Boxplots
A boxplot (box-and-whiskers plot) summarises some key features of asample distribution using quantiles.
The plot shows:
the line inside the box (the median)the box: first to third quartiles (q25 to q75), i.e. the middle 50% ofthe observations
the whiskers: either to the minimum and maximum, or up to a lengthof 1.5 times the width of the box, whichever is nearer (the rest of thedata, except for outliers)
shown as individual points: observations beyond the ends of thewhiskers (regarded as outliers).
A much longer whisker (and/or outliers) in one direction relative to theother indicates a skewed distribution.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 88
Boxplot of GDP per capita for 155 countries
0
10
20
30
40
GDP
percapita
Median = 4.7
Minimum = 0.5
Maximum = 37.8
3rd Quartile = 11.4
1st Quartile = 1.7
(IQR = 11.4-1.7 = 9.7)
23.7 = Largest observation at most
1.5 x IQR = 14.6 above 3rd Quartile
Outliers
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 89
Summary statistics: examples
Median Mean s.d. IQR Range
Level of democracy (p. 46) 6 5.3 3.9 8 10
GDP per capita (p. 50) 4.7 8.6 9.5 9.7 37.3
Blood pressures (p. 53) 73.5 74.2 11.3 14.5 88
Examination marks (p. 54) 60.5 59.7 17.5 21.3 94
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 90
Sample moments Sample skewness
8/10/2019 ST102 MT Section 2
16/19
Sample moments
Note: This page is skipped now, but is not marked with. This is becausesample moments will be used again, early in Part II of the course.
Let us define, for a variable Xand for each r= 1, 2, . . . :
the rth sample moment about zero: mr=
ni=1
Xri
n
the rth central sample moment: mr=
ni=1
(Xi X)r
n .
In other words, these are sample averages of the powers Xri and (Xi
X)r.
Clearly, X=m1 and S2 = [n/(n 1)] m2 = [n/(n 1)][m2 (m1)2].Moments of powers 3 and 4 are used in two more summary statistics thatare described below (asmaterial). These are used much less often thanmeasures of central tendency and dispersion.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 91
Sample skewness
A measure of the skewnessof the distibution of a variable X is:
g1=
m3
m3/22 =i(Xi
X)3
[i(Xi X)2]3/2 .For this measure, g1= 0 for a symmetric distribution, and g1< 0 for anegatively skewed distribution and g1> 0 for a positively skeweddistribution.
For example, g1= 0.006 for the (fairly symmetric) blood pressure
distribution shown on p. 53, and g1 = 1.24 for the (positively skewed)GDP per capita distribution shown on p. 51.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 92
Sample kurtosisKurtosisrefers to yet another characteristic of a sample distribution. Thishas to do with the relative sizes of the peak and tails of the distribution(think about shapes of histograms).
A distribution with high kurtosis (leptokurtic) has a sharp peak and ahigh proportion of observations in the tails far from the peak.
A distribution with low kurtosis (platykurtic) is flat, with nopronounced peak with most of the observations spread evenly aroundthe middle and weak tails.
A sample measure of kurtosis is:
g2= m4
m22 3 =
i(Xi X)4
[i(XiX)2]2
3.
This is g2> 0 for leptokurtic and g2< 0 for platykurtic distributions, andg2= 0 for the normal distribution (introduced later). Some softwarepackages define a measure of kurtosis without the 3, i.e. excess kurtosis.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 93
2.5 Associations between two variables
So far we have tried to summarise (some aspect of) the sampledistribution ofonevariable at a time.
But we can also look at two (or more) variables together. The key
question is then whether some values of one variable tend to occurfrequently together with particular values of another, for example highvalues with high values. This would be an example of an associationbetween the variables. Such associations are central to most interestingresearch questions, so you will hear much more about them in the future.
Some common methods of descriptive statistics for two-variableassociations are introduced here, but only very briefly and mainly through
examples.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 94
Different types of two variable plots and tables Scatterplots
8/10/2019 ST102 MT Section 2
17/19
Different types of two-variable plots and tables
The best way to summarise two variables together depends on whether thevariables have few or many possible values.
We illustrate one method for each combination:
Many vs. many: scatterplots (including line plots).
Few vs. many: side-by-side boxplots.
Few vs. few: two-way cross-tabulations.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 95
Scatterplots
A scatterplot shows the values of two continuousvariables against eachother, plotted as points in a two-dimensional coordinate system.
Example: A plot of data for 164 countries, with:on the horizontal axis (x-axis): a World Bank measure of control ofcorruption, where high values indicate low levels of corruption
on the vertical axis (y-axis): GDP per capita.
Interpretation: It appears that virtually all countries with high levels ofcorruption have relatively low GDP per capita. At lower levels of
corruption there is a positive association, where countries with very lowlevels of corruption also tend to have high GDP per capita.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 96
An example of a scatterplot
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 97
Line plots (time series plots)
A common special case of a scatterplot is a line plot (time series plot),where the variable on the x-axis is time. The points are connected in time
order by lines, to show how the variable on the y-axis changes over time.
Example: Time series of an index of prices of consumer goods and servicesin the UK, 18002009 (Office for National Statistics; scaled so that theprice level in 1974 = 100). This shows the price inflation over that period.
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 98
Example of a time series plot: inflation Side-by-side boxplots for comparisons
8/10/2019 ST102 MT Section 2
18/19
Example of a time series plot: inflation
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 99
Side-by-side boxplots for comparisons
Boxplots are useful for comparisonsof how the distribution of a continuousvariable varies across different groups, i.e. across different levels of adiscrete variable.
Example: Boxplots of GDP per capita in different regions.GDP per capita in African countries tends to be very low. There is ahandful of countries with somewhat higher GDPs per capita(designated as outliers in the plot).
The median for Asia is not much higher than for Africa. However, thedistribution in Asia is heavily skewed to the right, with a tail ofcountries with very high GDPs per capita.
The median in Europe is high, and the distribution is fairly symmetric.
The boxplots for North America and Oceania are not very useful,because they are based on very few countries (2 and 3, respectively).
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 100
Example of side-by-side boxplots
OceaniaNorth Am .Latin Am .EuropeAsiaAfrica
40
30
20
10
0
Region
GDP
percapita
Boxplot of GDP per capita by region
ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 101
Two-way contingency tablesA (two-way) contingency table(orcross-tabulation) shows thefrequencies in the sample of each possible combinationof the values oftwo discrete variables.
Often it also shows percentages within each rowor column of the table.
Example: From a survey of 972 private investors1:
row variable: age as a discrete, grouped variable (four categories)
column variable: how much importance the person places onshort-term gains from his/her investments (four levels).
Interpretation: Look at the row percentages. For example, 17.8% ofthose aged under 45, but only 5.2% of those 65 and over, think thatshort-term gains are very important. Among these respondents, the older
group seems to be less concerned with quick profits than the youngergroup.
1Lewellen et al. (1977) Patterns of investment strategy and behavior amongindividual investors. The Journal of Business.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 102
Example of a two-way contingency table
8/10/2019 ST102 MT Section 2
19/19
Example of a two way contingency table
Importance of short-term gainsSlightly Very
Age group Irrelevant important Important important Total
Under 45 37 45 38 26 146
(25.3) (30.8) (26.0) (17.8) (100)4554 111 77 57 37 282
(39.4) (27.3) (20.2) (13.1) (100)
5564 153 49 31 20 253(60.5) (19.4) (12.3) (7.9) (100)
65 and over 193 64 19 15 291(66.3) (22.0) (6.5) (5.2) (100)
Total 494 235 145 98 972(50.8) (24.2) (14.9) (10.1) (100)
(Numbers in parentheses are percentages within the rows. For example,25.3 = (37/146) 100.)ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 103