ST102 MT Section 2

8/10/2019 ST102 MT Section 2

1/19

ST102Elementary Statistical Theory

Descriptive statistics

Dr James Abdey

Department of StatisticsLondon School of Economics and Political Science

ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 31

Part I: 2. Descriptive statistics

Part I:

1. Introduction.2. Descriptive statistics.

3. Introduction to probability theory.

4. Random variables.

5. Some common distributions of random variables.

6. Multivariate random variables.

7. Sampling distributions of statistics.


2. Descriptive statistics

2.1: Introduction.

2.2: The sample distribution.

2.3: Measures of central tendency.

2.4: Measures of dispersion.

2.5: Associations between two variables.


2.1: Introduction

Starting point: A collection of numerical data (a sample) has beencollected in order to answer some questions.

Statistical analysis may have two broad aims:

1. Descriptive statistics: Summarise the data that were collected, inorder to make the data more understandable.

2. Statistical inference: Use the observed data to draw conclusionsabout some broader population.

Sometimes 1. is the only aim.

Even when 2. is the main aim, 1. is still an essential first step.


8/10/2019 ST102 MT Section 2

2/19

Need for descriptive statistics

Data do notjust speak for themselves: There are usually simply too many

numbers to make sense of just by staring at them.

Descriptive statistics attempt to summarise some key features of thedata to make them understandable and easy to communicate.

These summaries may benumerical (tables or individual summarystatistics) orgraphical.


Example

Consider data for 155 countries on three things, from around 2002:

Regionof the country.

Coded as 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 =North America and 6 = Oceania.

Level of democracy in the country.

An 11-point scale from 0 (lowest level of democracy) to 10 (highest).

Gross domestic product (GDP) per capita (in $000s).


A datasetThe statistical data in a sample are typically stored in a data matrix:


Units and variables

Rowsof the data matrix correspond to different units(subjects).

Here each unit is a country.

The number of units in a dataset is the sample size, typically denoted bythe letter n.

Here n= 155 countries.

Columnsof the data matrix correspond to variables, i.e. differentcharacteristics of the units.

Here region, level of democracy and GDP per capita are the variables.


8/10/2019 ST102 MT Section 2

3/19

Continuous and discrete variables

Different variables may have different properties. These determine whatkinds of statistical methods are suitable for the variables.

Acontinuousvariable can, in principle, take any real values within some(continuous) interval.

For example GDP per capita, which can have any values 0.

A variable is discreteif it is not continuous, i.e. if it can only take certain(usually integer) values, but not any others.

For example region, with possible values 1, 2, 3, 4, 5 and 6, and thelevel of democracy, with possible values 0, 1, 2, . . . ,10.


Discrete variables: number of possible values

Many discrete variables have only a finitenumber of possible values. Inour example, region has 6, and level of democracy has 11 possible values.

The simplest possibility is a binary (dichotomous) variable, with just twovalues. For example, a persons sex recorded as 1 = female and 2 = male.

A discrete variable can also have an unlimited number of possible values.

For example, the number of visitors to a website in a day could be

0, 1, 2, 3, 4, . . . .


Discrete variables: ordering of the valuesIn the example, the levels of democracy have a meaningful ordering, fromless to more democratic countries.

The numbers assigned to the different levels must also be in this order: alarger number = more democratic.

In contrast, different regions (Africa, Asia, Europe, Latin America, NorthAmerica and Oceania) do not have such a natural ordering.

The numbers used for the variable Region are just labels for differentregions. A different numbering (for example 6 = Africa, 5 = Asia, 1 =Europe, 3 = Latin America, 2 = North America and 4 = Oceania) wouldbe just as acceptable as the one we used.

Some statistical methods are appropriate for variables with both orderedand unordered values, some only in the ordered case.

Unordered categories arenominal data; ordered categories are ordinaldata.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 41

Using computers for statistical analysis

For understanding and practice, we make you calculate some descriptivestatistics by hand.

However, most real statistical analysis is done with computers, usingstatistical software packages.

To give you an idea of how they work, in Exercise 1 we ask you to do somedescriptive statistics with a package called Minitab.

See a note on the ST102 Moodle site for instructions on how to useMinitab for the exercise.

There are many other statistical packages which do more or less the samething and which you may encounter in later courses: Stata, SPSS, R, SASand others.


8/10/2019 ST102 MT Section 2

4/19

2.2 The sample distribution

The sample distribution of a variable consists of:

a list of the values of the variable that are observed in the sample

the number of times each value occurs (the countsorfrequenciesofthe observed values).

When the number of different observed values is small, we can show the

whole sample distribution as a frequency tableof all the values and theirfrequencies.


Example: observations of region in the sample

3 1 1 4 2 6 3 2 2 2 3 3 1 2 4

1 4 3 1 2 1 1 2 1 5 1 4 2 4 1

1 4 1 3 4 2 3 3 1 4 2 4 1 4 1

1 3 1 6 3 3 1 1 2 3 1 3 4 1 1

4 4 4 3 2 2 2 2 3 2 3 4 2 2 2

1 2 2 2 3 1 1 1 3 3 1 1 2 1 1

1 4 3 2 1 1 2 1 2 3 4 1 1 3 6

2 2 4 4 4 2 6 3 3 2 3 3 1 1 2

2 1 3 1 2 3 3 3 2 1 1 3 3 2 2

2 1 2 1 4 1 2 2 2 1 3 3 4 5 24 2 2 1 1


Frequency table of region

RelativeFrequency frequency

Region (count) (%)

100

(48/155)

(1) Africa 48 31.0(2) Asia 45 29.0

(3) Europe 34 21.9

(4) Latin America 23 14.8

(5) North America 2 1.3

(6) Oceania 3 1.9

Total 155 100

Here % is the percentage of countries in a region, out of the sample of155 countries. This is a measure ofproportion (relative frequency).


Frequency table of the level of democracy

Democracy Cumulativeindex Frequency % %

0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.9

3 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.38 16 10.3 69.69 15 9.7 79.3

10 32 20.6 100Total 155 100

Cumulative % for a value of the variable is the sum of the percentagesfor that value and all lower-numbered values.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 46

8/10/2019 ST102 MT Section 2

5/19

Bar charts

Abar chart is the graphical equivalent of the table of frequencies.

Africa Asia Europe Latin

America

Northern

America

Oceania

Region

0

10

20

30

40

50

Count


Bar chart of the level of democracy

0 1 2 3 4 5 6 7 8 9 1 0

Democracy index

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

Percentage


Sample distributions of variables with manyvalues

If a variable has many distinct values, listing frequencies of all of themdoes not work well.

Solution: Group the values into non-overlapping intervals, and do a tableor graph of the frequencies within the intervals.

The most common graph used for this is a histogram.

Like a bar chart, but histograms are without gaps between bars.

A histogram often uses more bars (intervals of values) than is sensiblein a table.

Histograms are usually drawn using statistical software you can letthe software choose the intervals and the number of bars.


Table of frequencies for GDP per capita

GDP per capita($000s) Frequency %

Less than 2.0 49 31.62.0 to 4.9 32 20.65.0 to 9.9 29 18.710.0 to 19.9 21 13.520.0 to 29.9 19 12.330.0 or more 5 3.2

Total 155 100


8/10/2019 ST102 MT Section 2

6/19

8/10/2019 ST102 MT Section 2

7/19

2.3 Measures of central tendency

Frequency tables, bar charts and histograms aim to summarise the wholesample distribution of a variable.

Next we consider descriptive statistics which summarise onefeature of thesample distribution in a single number: summary statistics.

We begin with measures of central tendency. These answer thequestion: Where is the centre or average of the distribution?. Weconsider:

the mean (arithmetic mean or average)

the median

the mode.


Preliminaries: notation for variables

In formulae, a generic variable is denoted by a single letter.

In these course notes, usually X.

Any other letter (Y, W, etc.) can also be used, as long as it is usedconsistently.

A letter with a subscript denotes a single observation of a variable.

For example, we use Xito denote the value ofXfor unit i, where ican take values 1, 2, 3, . . . , n, and n is the sample size.

Therefore, the n observations ofX in the dataset (the sample) areX1, X2, X3, . . . ,Xn. These can also be written as Xi, i= 1, . . . , n.


Preliminaries: summation notationLetX1, X2, . . . ,Xn (i.e. Xi, i= 1, . . . ,n) be a set ofn numbers. The sumof the numbers is written as:

ni=1

Xi=X1+X2+ +Xn.

This may also be written as iXi or just Xi.Other versions of the same idea:

Infinite sums:i=1

Xi =X1+X2+ .

Sums of sets of observations other than 1 to n, for example:

n/2i=2

Xi =X2+ X3+ +Xn/2.


Properties of the summation operator

Here Xi and Yi (i= 1, . . . , n) are sets ofn numbers.

Here a denotes a constant, i.e. a number with the same value for all i.

All of the following results follow simply from the properties of addition (ifyou are still not convinced, try them with n= 3).

(1)n

i=1a= n a.

Proof:n

i=1a =

ntimes (a+ +a) =n a.

(2) iaXi=a iXi.Proof:

iaXi = (aX1+ + aXn) =a(X1+ + Xn) =a

iXi.


8/10/2019 ST102 MT Section 2

8/19

Properties of the summation operator

(3)

i(Xi+Yi) =

iXi+

iYi.

Proof: Rearranging the elements of the summation, we get:i

(Xi+Yi) = [(X1+Y1) + (X2+Y2) + + (Xn+Yn)]

= [(X1+X2 +Xn) + (Y1+Y2+ +Yn)]

= (X1+X2+ +Xn) + (Y1+Y2+ +Yn)

= i

Xi+i

Yi.


Extension: double (triple etc.) summation

Sometimes sets of numbers may be indexed with two (or even more)subscripts, for example as Xij, i= 1, . . . , n, j= 1, . . . ,m.

Summation over both indices is written as:

ni=1

mj=1

Xij =n

i=1

(Xi1+ +Xim)

= (X11+ +X1m) + (X21+ +X2m)+ + (Xn1+ +Xnm).

The order of summation can be changed, that is:

ni=1

mj=1

Xij=m

j=1

ni=1

Xij.


Product notation

The analogous notation for the productof a set of numbers is:

ni=1

Xi=X1 X2 Xn.

It follows from the properties of multiplication that, for example:

1.n

i=1aXi=a

n

ni=1

Xi

.

2.n

i=1a= an.

3.n

i=1XiYi= n

i=1Xi n

i=1Yi.


The sample mean

Thesample mean (arithmetic mean, mean or average) is the mostcommon measure of central tendency.

The sample mean of a variable X is denoted as X.

It is the sum of the observations divided by the number of observations(sample size):

X=

ni=1

Xi

n .


8/10/2019 ST102 MT Section 2

9/19

The sample mean

For example, the mean X=

iXi/n of the numbers 1, 4 and 7 is:

X =1 + 4 + 7

3 =

12

3 = 4.

For the variables in the country example:

The level of democracy: X = 5.3.

GDP per capita: X = 8.6 (in $000s).

Region: the mean is not meaningful, because the values of thevariable do not have a meaningful ordering.


Frequency table of the level of democracy

Value of the levelof democracy Frequency Cumulative(Xj) (fj) % %

0 35 22.6 22.6

1 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.3

8 16 10.3 69.69 15 9.7 79.310 32 20.6 100

Total 155 100


Mean from a frequency tableIf a variable has a small number of distinct values, X is easy to calculatefrom the frequency table.

For example, the level of democracy has just 11 different values, whichoccur in the sample 35, 12, . . . , 32 times each, respectively.

Suppose X has K different values X1, X2, . . . ,XK, with correspondingfrequencies f1, f2, . . . , fK. Then

Kj=1

fj=n and:

X =

Kj=1

fjXj

K

j=1fj

= f1X1+ +fKXK

f1+ +fK=

f1X1+ +fKXKn

.

In our example, the mean level of democracy (where K= 11) is:

X =35 0 + 12 1 + + 32 10

35 + 12 + 4 + + 32 =0 + 12 + 8 + + 320

155 5.3.


Why is the mean a good summary of centraltendency?

Consider the following small dataset:

Deviations:

from X (= 4) from Median (= 3)

i Xi Xi X (Xi X)2 Xi 3 (Xi 3)21 1 3 9 2 42 2 2 4 1 13 3 1 1 0 04 5 +1 1 +2 45 9 +5 25 +6 36

Sum 20 0 40 +5 45X = 4


8/10/2019 ST102 MT Section 2

10/19

The sum of deviations from the mean is 0The mean is in the middle of the observations X1, . . . ,Xn, in the sensethat positive and negative values of the deviations Xi X cancel out,when summed over all the observations, that is:

n

i=1 (X

i X) = 0.

Proof: [The proof uses the definition ofX and properties of summationintroduced earlier. Note that Xis a constant in the summation, because ithas the same value for all i.]

n

i=1 (XiX) =

n

i=1 Xin

i=1 X =n

i=1 Xi n X

=n

i=1

Xi n

ni=1

Xi

n =

ni=1

Xin

i=1

Xi= 0.


Mean minimises the sum of squared deviationsThe smallest possible value of the sum of squared deviations

ni=1

(Xi C)2

for any constant Cis obtained when C= X.

Proof:(Xi C)2 = (Xi=0

X+ XC)2 = [(Xi X) + (X C)]2=

[(Xi X)2 + 2(Xi X)(X C) + (X C)2]

=

(Xi X)2 +

2(Xi X)(X C) +

(X C)2

=

(Xi X)2 + 2(X C)

=0

(Xi X) +n(X C)2

= (Xi X)2 +n(X C)2

(Xi X)2

since n(X C)2 0 for any choice ofC. Equality is obtained only whenC= X, so that n(X C)2 = 0. ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 68

The (sample) median

LetX(1),X(2), . . . ,X(n) denote the sample values ofXordered from thesmallest to the largest, such that:

X(1) is the smallest observed value (the minimum) ofX

X(n) is the largest observed value (the maximum) ofX.


The (sample) median

The (sample) median, q50, of a variable X is the value that is in themiddle of the ordered sample.

Ifn is odd, q50= X((n+1)/2).

For example, ifn= 3, q50= X(2): (1) (2) (3)

Ifn is even, q50= [X(n/2)+X(n/2+1)]/2.

For example, ifn= 4, q50= [X(2)+X(3)]/2: (1) (2) (3) (4)

In the country example, n= 155, so q50= X(78). For the level ofdemocracy, the median is 6.

From a table of frequencies, the median is the value for which thecumulative percentage first reaches 50% (or, if a cumulative % is exactly50%, the average of the corresponding value ofXand the next-highervalue).


8/10/2019 ST102 MT Section 2

11/19

Example: ordered values of level of democracy

(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)

(0.) 0 0 0 0 0 0 0 0 0

(1.) 0 0 0 0 0 0 0 0 0 0

(2.) 0 0 0 0 0 0 0 0 0 0

(3.) 0 0 0 0 0 0 1 1 1 1

(4.) 1 1 1 1 1 1 1 1 2 2

(5.) 2 2 3 3 3 3 3 3 4 4

(6.) 4 4 4 5 5 5 5 5 6 6

(7.) 6 6 6 6 6 6 6 6 6 6

(8.) 7 7 7 7 7 7 7 7 7 7

(9.) 7 7 7 8 8 8 8 8 8 8

(10.) 8 8 8 8 8 8 8 8 8 9

(11.) 9 9 9 9 9 9 9 9 9 9

(12.) 9 9 9 9 10 10 10 10 10 10

(13.) 10 10 10 10 10 10 10 10 10 10

(14.) 10 10 10 10 10 10 10 10 10 10

(15.) 10 10 10 10 10 10


Median from the frequency table of level of

democracy

Value of level ofdemocracy Frequency Cumulative(Xj) (fj) % %

0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.2

6 12 7.7 50.9

7 13 8.4 59.38 16 10.3 69.69 15 9.7 79.310 32 20.6 100

Total 155 100


The mean is sensitive to outliers

For the following sample, the mean and median are both 4:

1 2 4 5 8.

If we add one observation to get the sample:

1 2 4 5 8 1 0 0

then the median is now 4.5and the mean is now 20.

In general the mean is affected much more than the median by outliers,i.e. unusually large or small observations.


Skewness, means and medians

The mean, more than the median, is pulled toward the longer tail of thesample distribution.

For a positively skewed distribution, the mean is larger than themedian.

For a negatively skewed distribution, the mean is smaller than themedian.

For an exactly symmetric distribution, the mean and median are equal.

When summarising variables with skewed distributions, it is useful to

report both the mean and the median.


8/10/2019 ST102 MT Section 2

12/19

Mean and median: examples

Median MeanLevel of democracy (p. 46) 6 5.3

GDP per capita (p. 50) 4.7 8.6

Blood pressures (p. 53) 73.5 74.2

Examination marks (p. 54) 60.5 59.7


Other measures of central tendency: the mode

The (sample) modeof a variable is the value which has the highestfrequency (i.e. appears most often) in the data.

For example, in the country example the mode of region is 1 (Africa) and

the mode of the level of democracy is 0.

The mode is not very useful for continuous variables which have manydifferent values, such as GDP per capita in the country example.

A variable can have several modes (i.e. be multimodal). For example,GDP per capita in the example has modes 0.8 and 1.9, both with 5countries out of the total sample of 155 countries.

The mode is the only measure of central tendency which can be used evenwhen the values of a variable have no ordering, such as for the regionvariable in the example.


Geometric and harmonic meansThegeometric mean G is defined as:

G=

ni=1

Xi

1/n

and the harmonic mean H as:

H=

ni=1

X1i /n

1=

nn

i=1(1/Xi)

.

Neither is used very often. Both are examples of the general formula:

g

1 i

g(Xi)/nwhere gis an invertible function and g1 its inverse function. We obtainX with g(x) =x, G with g(x) = log(x) and Hwith g(x) = 1/x.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 77

2.4 Measures of dispersion (variation)

Central tendency is not the whole story. The following two sampledistributions have the same mean:

...but they are clearly not the same. In one (red) the values have moredispersion (variation) than in the other.


8/10/2019 ST102 MT Section 2

13/19

A small example again

Deviations from X

i Xi X2i Xi X (Xi X)2

1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25

Sum 20 120 0 40X = 4 = X

2i = (Xi

X)2

The first measures of dispersion, the sample variance and its square root,the sample standard deviation, are based on (Xi X)2, the squareddeviations from the mean.


Sample variance

Thesample varianceof a variable X, denoted S2

(orS2X), is defined as:

S2 =

ni=1

(Xi X)2

n 1 .


Sample standard deviationThe sample standard deviation (s.d. for short) ofX, denoted S (orSX),is the square root of the sample variance, i.e. we have:

S=

ni=1

(Xi X)2

n

1

.

This is the most commonly used measure of dispersion. The standarddeviation is more understandable than the variance because it is expressedin the same units as X (rather than X2).

A rule-of-thumb for interpretation is that for a symmetric distributionoften:

about 2/3 of the observations are between X

S and X+S

about 95% of the observations are between X 2S and X+ 2S.

Remember that standard deviations (and variances) are nevernegative andthey are zero onlyif all the observations Xiare the same.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 81

An alternative formula for the varianceThe sum of squares in S2 can also be expressed as:

ni=1

(Xi X)2 =n

i=1

X2i n X2

Proof:

ni=1

(Xi X)2 =n

i=1

(X2i 2XiX+X2)

=n

i=1X2i 2X

=nX n

i=1Xi+

=nX2 n

i=1X2

=n

i=1

X2i nX2.


A l i f l f h i

S l i l f l l i

8/10/2019 ST102 MT Section 2

14/19

An alternative formula for the variance

The sample variance can therefore also be calculated as:

S2 =

ni=1

X2i n X2

n

1

(and the standard deviation S=

S2 again).

This formula is most convenient for calculations by hand.

If using a frequency table, we can also calculate:

S2 =

Kj=1

fjX2j n X2

n 1(see p. 66 for the analogous formula for the mean).


Sample variance: example of calculations

Deviations from X

i Xi X2i Xi X (Xi X)2

1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25

Sum 20 120 0 40X = 4 =

X2i =

(Xi X)2

We have:

S2 =

(Xi X)2

n 1 =40

4= 10 =

X2i nX2

n 1 =120 5 42

4

and S=

S2 =

10 = 3.16.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 84

Sample quantiles

The median, q50, is basically the value which divides the sample into thesmallest 50% of observations and the largest 50% of observations.

If we consider other percentage splits, we get other (sample) quantiles(percentiles) qc, for example:

thefirst quartile, q25, is the value which divides the sample into thesmallest 25% of observations and the largest 75% of observations.

the third quartile, q75, for the 7525 split

the extremes in this spirit are the minimum X(1) (the 0% quantile,so to speak) and maximum X(n) (the 100% quantile).

These are no longer in the middle of the sample, but they are moregeneral measures of location of the sample distribution.


Calculation of sample quantilesThis is how computer software calculates general sample quantiles (or howyou can do so by hand, if you ever needed to).

Suppose we need to calculate the cth sample quantile,qc, where0< c

8/10/2019 ST102 MT Section 2

15/19

Quantile-based measures of dispersion

Two measures based on quantile-type statistics are:

Range: X(n) X(1) = maximum minimum.Interquartile range (IQR):q75 q25= third quartile first quartile.

The range is clearly extremely sensitive to outliers, since it depends onnothing but the extremes of the distribution.

The IQR focuses on the middle 50% of the distribution, so it is completely

insensitive to outliers.


Boxplots

A boxplot (box-and-whiskers plot) summarises some key features of asample distribution using quantiles.

The plot shows:

the line inside the box (the median)the box: first to third quartiles (q25 to q75), i.e. the middle 50% ofthe observations

the whiskers: either to the minimum and maximum, or up to a lengthof 1.5 times the width of the box, whichever is nearer (the rest of thedata, except for outliers)

shown as individual points: observations beyond the ends of thewhiskers (regarded as outliers).

A much longer whisker (and/or outliers) in one direction relative to theother indicates a skewed distribution.


Boxplot of GDP per capita for 155 countries

0

10

20

30

40

GDP

percapita

Median = 4.7

Minimum = 0.5

Maximum = 37.8

3rd Quartile = 11.4

1st Quartile = 1.7

(IQR = 11.4-1.7 = 9.7)

23.7 = Largest observation at most

1.5 x IQR = 14.6 above 3rd Quartile

Outliers


Summary statistics: examples

Median Mean s.d. IQR Range

Level of democracy (p. 46) 6 5.3 3.9 8 10

GDP per capita (p. 50) 4.7 8.6 9.5 9.7 37.3

Blood pressures (p. 53) 73.5 74.2 11.3 14.5 88

Examination marks (p. 54) 60.5 59.7 17.5 21.3 94


Sample moments Sample skewness

8/10/2019 ST102 MT Section 2

16/19

Sample moments

Note: This page is skipped now, but is not marked with. This is becausesample moments will be used again, early in Part II of the course.

Let us define, for a variable Xand for each r= 1, 2, . . . :

the rth sample moment about zero: mr=

ni=1

Xri

n

the rth central sample moment: mr=

ni=1

(Xi X)r

n .

In other words, these are sample averages of the powers Xri and (Xi

X)r.

Clearly, X=m1 and S2 = [n/(n 1)] m2 = [n/(n 1)][m2 (m1)2].Moments of powers 3 and 4 are used in two more summary statistics thatare described below (asmaterial). These are used much less often thanmeasures of central tendency and dispersion.


Sample skewness

A measure of the skewnessof the distibution of a variable X is:

g1=

m3

m3/22 =i(Xi

X)3

[i(Xi X)2]3/2 .For this measure, g1= 0 for a symmetric distribution, and g1< 0 for anegatively skewed distribution and g1> 0 for a positively skeweddistribution.

For example, g1= 0.006 for the (fairly symmetric) blood pressure

distribution shown on p. 53, and g1 = 1.24 for the (positively skewed)GDP per capita distribution shown on p. 51.


Sample kurtosisKurtosisrefers to yet another characteristic of a sample distribution. Thishas to do with the relative sizes of the peak and tails of the distribution(think about shapes of histograms).

A distribution with high kurtosis (leptokurtic) has a sharp peak and ahigh proportion of observations in the tails far from the peak.

A distribution with low kurtosis (platykurtic) is flat, with nopronounced peak with most of the observations spread evenly aroundthe middle and weak tails.

A sample measure of kurtosis is:

g2= m4

m22 3 =

i(Xi X)4

[i(XiX)2]2

3.

This is g2> 0 for leptokurtic and g2< 0 for platykurtic distributions, andg2= 0 for the normal distribution (introduced later). Some softwarepackages define a measure of kurtosis without the 3, i.e. excess kurtosis.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 93

2.5 Associations between two variables

So far we have tried to summarise (some aspect of) the sampledistribution ofonevariable at a time.

But we can also look at two (or more) variables together. The key

question is then whether some values of one variable tend to occurfrequently together with particular values of another, for example highvalues with high values. This would be an example of an associationbetween the variables. Such associations are central to most interestingresearch questions, so you will hear much more about them in the future.

Some common methods of descriptive statistics for two-variableassociations are introduced here, but only very briefly and mainly through

examples.


Different types of two variable plots and tables Scatterplots

8/10/2019 ST102 MT Section 2

17/19

Different types of two-variable plots and tables

The best way to summarise two variables together depends on whether thevariables have few or many possible values.

We illustrate one method for each combination:

Many vs. many: scatterplots (including line plots).

Few vs. many: side-by-side boxplots.

Few vs. few: two-way cross-tabulations.


Scatterplots

A scatterplot shows the values of two continuousvariables against eachother, plotted as points in a two-dimensional coordinate system.

Example: A plot of data for 164 countries, with:on the horizontal axis (x-axis): a World Bank measure of control ofcorruption, where high values indicate low levels of corruption

on the vertical axis (y-axis): GDP per capita.

Interpretation: It appears that virtually all countries with high levels ofcorruption have relatively low GDP per capita. At lower levels of

corruption there is a positive association, where countries with very lowlevels of corruption also tend to have high GDP per capita.


An example of a scatterplot


Line plots (time series plots)

A common special case of a scatterplot is a line plot (time series plot),where the variable on the x-axis is time. The points are connected in time

order by lines, to show how the variable on the y-axis changes over time.

Example: Time series of an index of prices of consumer goods and servicesin the UK, 18002009 (Office for National Statistics; scaled so that theprice level in 1974 = 100). This shows the price inflation over that period.


Example of a time series plot: inflation Side-by-side boxplots for comparisons

8/10/2019 ST102 MT Section 2

18/19

Example of a time series plot: inflation


Side-by-side boxplots for comparisons

Boxplots are useful for comparisonsof how the distribution of a continuousvariable varies across different groups, i.e. across different levels of adiscrete variable.

Example: Boxplots of GDP per capita in different regions.GDP per capita in African countries tends to be very low. There is ahandful of countries with somewhat higher GDPs per capita(designated as outliers in the plot).

The median for Asia is not much higher than for Africa. However, thedistribution in Asia is heavily skewed to the right, with a tail ofcountries with very high GDPs per capita.

The median in Europe is high, and the distribution is fairly symmetric.

The boxplots for North America and Oceania are not very useful,because they are based on very few countries (2 and 3, respectively).


Example of side-by-side boxplots

OceaniaNorth Am .Latin Am .EuropeAsiaAfrica

40

30

20

10

0

Region

GDP

percapita

Boxplot of GDP per capita by region


Two-way contingency tablesA (two-way) contingency table(orcross-tabulation) shows thefrequencies in the sample of each possible combinationof the values oftwo discrete variables.

Often it also shows percentages within each rowor column of the table.

Example: From a survey of 972 private investors1:

row variable: age as a discrete, grouped variable (four categories)

column variable: how much importance the person places onshort-term gains from his/her investments (four levels).

Interpretation: Look at the row percentages. For example, 17.8% ofthose aged under 45, but only 5.2% of those 65 and over, think thatshort-term gains are very important. Among these respondents, the older

group seems to be less concerned with quick profits than the youngergroup.

1Lewellen et al. (1977) Patterns of investment strategy and behavior amongindividual investors. The Journal of Business.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 102

Example of a two-way contingency table

8/10/2019 ST102 MT Section 2

19/19

Example of a two way contingency table

Importance of short-term gainsSlightly Very

Age group Irrelevant important Important important Total

Under 45 37 45 38 26 146

(25.3) (30.8) (26.0) (17.8) (100)4554 111 77 57 37 282

(39.4) (27.3) (20.2) (13.1) (100)

5564 153 49 31 20 253(60.5) (19.4) (12.3) (7.9) (100)

65 and over 193 64 19 15 291(66.3) (22.0) (6.5) (5.2) (100)

Total 494 235 145 98 972(50.8) (24.2) (14.9) (10.1) (100)

(Numbers in parentheses are percentages within the rows. For example,25.3 = (37/146) 100.)ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 103

Documents

ST102 MT Section 2