27
CRICOS Provider Code 00301J Library Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY CLAIRE HULCUP

Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

  • Upload
    others

  • View
    18

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Introduction to StatisticsWORKSHOP 1: DESCRIPTIVE STATISTICSPRESENTED BY CLAIRE HULCUP

Page 2: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Workshop Summary

This workshop looks at the use of descriptive statistics for univariate analysis. In particular, it will cover:

• distinguishing between different types of variables;• displaying and analysing categorical variables; and• displaying and analysing continuous variables.

For more information on any of these topics, or for additional numeracy support, please contact Claire Hulcup at The Learning Centre by:

Phone: 9266 2290Email: [email protected]

For an online version of these resources, visit:http://studyskills.curtin.edu.au/study-resources/workshop-handouts/introduction-to-statistics/

Page 3: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Descriptive Statistics for Univariate AnalysisWhen we use statistics to summarise or describe only the data we have (i.e. our sample), as opposed to drawing conclusions about larger populations, we refer to them as descriptive statistics.

When our descriptive statistics involve summarising just one variable, summarising multiple variables separately, or comparing the same variables between different populations we refer to it as univariate analysis.

This session looks at using descriptive statistics for univariate analysis, while subsequent sessions focus on inferential statistics, bivariate analysis and testing for reliability.

Page 4: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Types of Data

Choosing the right descriptive statistics depends on the type of data (and hence the type of variable), so it’s important to be able to distinguish between these. Broadly speaking, your data will be one of the following types:

• Categorical data: data which is grouped into categories, such as data for gender or smoking status. Categorical data can be further classified as:

• Nominal data: when the categories do not have an order (e.g. marital status)• Ordinal data: when the categories do have an order (e.g. satisfaction level)• Binary data: when there are only two categories

• Continuous data: data which is measured on a continuum, and which can take on a large number of possible values, such as weights and distances.

Note that discrete data is a special type of data which can be treated as either of the above types, depending on how many values are possible. It is measured as counts or numbers of events, e.g. number of children in a family.

Page 5: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Types of Data

• Have a go at the problems in Question 1 of the worksheet accompanying this workshop

Page 6: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Categorical Data: Tables

Categorical data can be displayed in a table, or in particular a frequency distribution table. This is a table which displays the various categories for a variable, along with the corresponding frequencies (i.e. how often each category occurs in the data) and often associated percentages.

For example, the following is a frequency table showing the frequency of each category of marital status in a sample of 80 people, along with the corresponding percentages:

Marital status

Frequency Percent

Cumulative

Percent

Valid Single 44 55.0 55.0

Married 29 36.3 91.3

Other 7 8.8 100.0

Total 80 100.0

Page 7: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Categorical Data: Column GraphsAnother way to display categorical data is using a column graph or bar chart.

Column graphs and bar charts use rectangles of equal width (which do not touch) to represent each data category. The rectangles are drawn vertically for column graphs and horizontally for bar charts, and in each case the height (or length) of the rectangles allows the various quantities for each category to be compared- typically as counts or percentages. For example:

Page 8: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Categorical Data: Pie ChartsCategorical data can also be displayed in a pie chart, in order to show how the sample is divided up between the various categories.

For example our data for marital status can be displayed in a pie chart as follows:

Page 9: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Analysing Categorical Data

Descriptive statistics for categorical data include frequencies, percentages, fractions and/or relative frequencies obtained from the variable’s frequency distribution table.

For example we can describe the distribution of our marital status variable in the sample by selecting some key values from the frequency distribution table, i.e.:• 91.3% of the sample are either single or married• 7 people in the sample identified their marital status as ‘other’• The relative frequency of people in the sample who are married is 0.36 (i.e. 29 ÷ 80)• Over ½ of the people in the sample are single

Marital status

Frequency Percent

Cumulative

Percent

Valid Single 44 55.0 55.0

Married 29 36.3 91.3

Other 7 8.8 100.0

Total 80 100.0

Page 10: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Analysing Categorical Data

• Have a go at the problems in Question 2 of the worksheet accompanying this workshop

Page 11: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Continuous Data: Tables

Continuous data can also be displayed in a frequency distribution table, but the data first needs to be grouped into bins (also known as class intervals). These should be chosen based on the data, but they should generally be of the same size and ideally there should not be too many.

For example, the following is a frequency distribution table showing the frequency of ages of a sample of 80 people, once they have been sorted into bins:

Age (Binned)

Frequency Percent

Cumulative

Percent

Valid 20 - 24 17 21.3 21.3

25 - 29 23 28.7 50.0

30 - 34 19 23.8 73.8

35 - 39 17 21.3 95.0

40 - 44 4 5.0 100.0

Total 80 100.0

Page 12: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Continuous Data: HistogramsOne of the graphs most commonly used to display continuous data (typically once it has been sorted into bins) is a histogram. These have the following properties:• as the variable is continuous the bars touch (unless an interval has zero frequency);• the bars are vertical;• the vertical axis should start at zero and should have no breaks; and• the area of each bar represents the frequency of the corresponding bin (however

since the area of a bar is calculated by multiplying the height by how many standard bins it is in width, the fact that most histograms have equal bin width means that the frequency is usually also equal to the height).

For example a histogram for our Age variable is as follows:

Page 13: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Displaying Continuous Data: Box plotsAnother way of displaying continuous data is a box plot, which is a diagram showing the way the data for a variable is distributed. For example a box plot for our Age variable is as follows:

Median = the middle value when the data set is arranged from smallest to largest

Lower quartile = the middle value between the smallest value and the median

Upper quartile = the middle value between the largest value and the median

Smallest value (except outliers)

Largest value (except outliers)

Page 14: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Analysing Continuous Data

Continuous data is generally analysed using two types of descriptive statistics:

• measures of central tendency, which summarise the data set by finding the average, central or typical member- for example mean, median and mode; and

• measures of dispersion, which summarise the data set by finding out how widely it is spread or dispersed- for example range, interquartile range, variance and standard deviation.

It is important not only to understand how to calculate and interpret the various measures of central tendency and dispersion, but to know when it is appropriate to use the various types…

Page 15: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Central Tendency

The most common measure of central tendency is the mean, which is the arithmetic average of a sample. It is denoted by �̅�𝑥, and hence for a sample of size 𝑛𝑛 we have:

�̅�𝑥 =∑𝑥𝑥𝑛𝑛

For example, the mean of the following sample of ten ages:19, 25, 28, 28, 23, 15, 28, 22, 24, 21

can be calculated as:

�̅�𝑥 =19 + 25 + 28 + 28 + 23 + 15 + 28 + 22 + 24 + 21

10= 23.3

Page 16: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Central Tendency

While the mean is typically used as the measure of central tendency for continuous data, it may not be appropriate if the data set is badly skewed, contains outliers or if the variable is censored (not fully observed). In these situations the median is more suitable.

The median is the midpoint of the distribution, which is calculated for a sample by sorting the data from smallest to largest and then finding the middle value (or the mean of the middle two values if there are an even number of observations).

For example, putting our sample of ten ages in order gives:15, 19, 21, 22, 23, 24, 25, 28, 28, 28

and the median is then the mean of the middle two:23 + 24

2 = 23.5

Page 17: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Central Tendency

Finally, the mode is the most frequently occurring value (or values) in the sample (if there are two or more values that are equally common we quote them all, rather than finding the average).

Note that the mode may not necessarily be anywhere near the middle of the data set, and hence is not necessarily ‘central’, but it is useful when the most common value is of interest.

For example, the mode of our sample of ten ages:19, 25, 28, 28, 23, 15, 28, 22, 24, 21

is 28.

Note also that the mode can be, and is most often, used for categorical data.

Page 18: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Measures of Central Tendency• Have a go at Questions 3a, 3b and 3c of the worksheet accompanying this workshop

Page 19: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Dispersion

The simplest measure of dispersion is the range; this is simply the difference between the smallest and the largest value in the sample.

For example, the range of our sample of ten ages is 13 (i.e. 28 – 15).

While the range is easy to calculate, it is usually of limited use as it takes into account just two values in the data set.

A measure of dispersion that takes into account more, if still not all, values in a data set is the calculation of percentiles. These measure position from the beginning of a data set, and can be used to measure the relative standing of a particular observation.

For example the 10th percentile of our sample of ages is 17, indicating that only 10% of the sample is less than 17.

Page 20: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Dispersion

A specific type of percentiles are quartiles which divide the sample into quarters, i.e. the:25th percentile (first or lower quartile);50th percentile (median); and75th percentile (third or upper quartile).

For example the quartiles for our ten ages (which are in order below) are:15, 19, 21, 22, 23, 24, 25, 28, 28, 28

Once you have determined the quartiles you can determine another measure of dispersion; the interquartile range. This is the difference between the first and the third quartiles, i.e. 7 in this case.

23.5Median

First quartile

Third quartile

Page 21: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Measures of Dispersion

• Have a go at Questions 3d and 3e of the worksheet accompanying this workshop

Page 22: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Dispersion

The interquartile range is generally quoted in conjunction with the median in situations where the latter is more appropriate than the mean. In all other situations the mean and standard deviation are usually used to summarise a sample.

The standard deviation takes into account all of the values in a sample, and is calculated by finding the deviation of each value from the mean, squaring the result, then adding them all together and dividing by one less than the sample size. This gives yet another measure of dispersion, known as the variance, and then taking the positive square root of this gives the standard deviation- which is more appropriate as it is measured on the original scale of the data.

The formula for the standard deviation (𝑠𝑠) of a sample of size 𝑛𝑛 is:

𝑠𝑠 =∑(𝑥𝑥 − �̅�𝑥)2

𝑛𝑛 − 1

Page 23: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Dispersion

While standard deviation, along with other measures of dispersion and central tendency, can be calculated using various software, it is also good to understand how to determine it manually. For example, to calculate the standard deviation of our ten ages:

19, 25, 28, 28, 23, 15, 28, 22, 24, 21we first need to calculate the mean of the sample (�̅�𝑥), which we know in this case is 23.3.

Next, calculating the deviation of each age from this mean gives:−4.3, 1.7, 4.7, 4.7,−0.3,−8.3, 4.7,−1.3, 0.7,−2.3

and squaring these gives:18.49, 2.89, 22.09, 22.09, 0.09, 68.89, 22.09, 1.69, 0.49, 5.29

Now adding these ten values gives a total of 164.1, and diving through by 9 (i.e. 10 – 1) gives 18.23. Finally, taking the square root gives the sample standard deviation of 4.27.

Page 24: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Measures of Dispersion

• Have a go at Questions 3f and 3g of the worksheet accompanying this workshop

Page 25: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Measures of Dispersion

A final measure of dispersion is the coefficient of variation (CV). This measures the amount of variation in a sample relative to its mean, and can be used to compare the spread of variables measured on different scales

The coefficient of variation is calculated by dividing the standard deviation of a sample by its mean, and then multiplying by 100.

For example, the coefficient of variation for our ten ages:19, 25, 28, 28, 23, 15, 28, 22, 24, 21

is:4.2723.3 × 100 = 18.33

Page 26: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Practise: Measures of Dispersion

• Have a go at Question 3h of the worksheet accompanying this workshop

Page 27: Introduction to Statistics - Curtin Universitystudyskills.curtin.edu.au/.../Statistics-Session-1.pdf · Introduction to Statistics WORKSHOP 1: DESCRIPTIVE STATISTICS PRESENTED BY

CRICOS Provider Code 00301JLibrary

Summary: Descriptive Statistics for Univariate Analysis