26
@mlemanchester My Learning Essentials The University of Manchester Library Adding up Introduction to statistics CHEAT SHEET objects/mle/statistics - www.escholar.manchester.ac.uk/learning

The University of Manchester Library My Learning … size of each segment represents the proportion of people in each category. Key characteristics Pie charts show data categories

  • Upload
    vuanh

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

@mlemanchester

My Learning Essentials

The University of Manchester Library

Adding up

Introduction to statistics

CHEAT SHEET

objects/mle/statistics-www.escholar.manchester.ac.uk/learning

Introduction

You don’t need to be a mathematician to have a grip on the basic principles of statistics. We

all encounter stats on a daily basis throughout our personal, academic and professional lives,

and having an understanding of the basics will help you to use data in a more meaningful

way.

This resource will give you a basic introduction to statistical terminology, charts and data

types. It will also give you advice on how to organise your data to make it easier to analyse

and present.

The term “levels of measurement” refers to what type of data we are collecting or analysing.

You will be able to interpret and analyse data more effectively by understanding the measure

being used.

There are three levels of measurement.

Levels of measurement

Nominal Qualitative data with discrete categories that have no order. Nationality,

gender and location are examples of nominal measurements.

Ordinal

Qualitative data with discrete categories that have an order. Examples

include clothing sizes and 5-star rating systems. If you rated a film

using a five star system, a score of 5 stars is better than 3 stars, and

that is better than 1 star. But the difference between 1 and 3 may not

be the same as that between 3 and 5. The values simply express an

order.

Levels of measurement

Scale

Quantitative data in which numeric values lie on a scale. There are two

type of scale measurements:

• Interval - variables in which distance between two values is

meaningful but the ratio is not. For example, temperature is an

interval measurement: you wouldn't say that a day is 1.5 times

hotter than another, but would say 5 degrees hotter.

• Ratio - variables in which the difference and ratio between two

values is meaningful. Examples of ratio measurements include

distance and duration.

Legitimate measures

The value that occurs most often

in the set.

Averages

Mode

Median

Mean

Definition Used for

Mode is valid for all types of data,

although it may not be meaningful for

scale data in small data sets.

The value in the middle of the

set.

Median can be used for scale or

ordinal data.

Mean is the arithmetic average

and considers the whole set.

Mean can be used for scale data

only.

There are a number of descriptive statistics that are used to describe data. These

measures fall into two main categories: averages and spread.

Which measure is used will depend on the level of measurement of the data. Click below to

see which measures are appropriate to use with the different levels of measurement.

Legitimate measures

The gap between the highest and

lowest values.

Spread

Range

Standard deviation

Definition Used for

Range can be used for scale or

ordinal data.

A calculated number to describe

the scatter of data about the

mean.

Standard deviation can be used

for scale data only.

Charts and frequency tables are used to represent data in a format that is clear and easy to understand.

This section will give you an overview of the different types of charts and the types of data they are used to

represent. You will also have a chance to explore frequency tables and the data they can be used to show.

Charts and basic analysis

Bar chart

In this example chart we can see several bars, each representing a

group of monthly commute costs of our survey respondents. The

height of each bar represents the number of responses for each

group.

Key characteristics

The categories on a bar chart can appear in any order. One axis will detail the categories being measured; the second axis displays the measurement scale.

What is it?

This is bar chart. This type of chart can be displayed using either vertical or horizontal bars. Bar charts show specific categories against a set scale. This allows you to compare the different categories values against the scale/value.

Use for

Direct comparison of the differences between categories of ordinal or nominal variables. A bar chart can allow easier comparisons between categories than a pie chart.

Do not use for

This type of chart is not the most suitable for scale variables. A histogram would be better; a histogram has a scale along the bottom axis representing the intervals in order. If a histogram is not available, a bar chart (with the bars pushed together) can be an acceptable alternative, providing that all of the intervals have a value. Bar charts do not show empty bins but histograms do.

Pie chart

In this example chart, the segments of the circle represent the

people who hold a valid driving licence and the people who do not.

The size of each segment represents the proportion of people in

each category.

Key characteristics

Pie charts show data categories relative to each other. Several variations of pie chart exist including doughnut charts and exploded pie charts.

What is it?

This is a pie chart. In this type of chart each category of data is represented as a segment of the circle. The complete circle represents 100% of the data; each segment represents a relative proportion of data.

Use for

Comparing the differences between categories of nominal or ordinal variables. A pie chart can also be more visually attractive and use less space than a bar chart.

Do not use for

This type of chart is not suitable for scale variables or those with a large number of categories.

Histogram

This example chart shows the number of responses for each

duration of commute times, from 0.5 to 2.5 hours.

Key characteristics

They are used to represent scale variables . The height of the bars is a representation of the distribution of data and the scale along the bottom is continuous.

What is it?

This is a histogram; although it looks similar to bar chart they present different data. Histograms allow you to see the distribution of data on a scale measurement.

Use for

Use to see the spread or shape of a scale variable.

Do not use for

This type of chart is not suitable for nominal or ordinal variables. If there are a lot of narrow bars, you may be able to group the responses by “binning” them.

Box plot

This is box plot chart. This type of chart is used to show the

distribution of a data set. This example shows the distance that

respondents commute to work.

Key characteristics

Box plot charts have very distinctive look, a box with a line coming out of the top, a line coming out of the bottom, and a line across the middle.

What is it?

This is box plot chart. This type of chart is used to show the distribution of an ordinal or scale data set. Box plots tell us several things about a set of data, click on image to learn more about the data shown.

Use for

Use for displaying the spread of the middle half of a variable, and its maximum and minimum values. You can have multiple box plots on the same scale, one for each category.

Do not use for

This type of chart is not suitable for nominal data.

Box plot

Highest value This line shows the highest value in the data set not including any outliners, in this case the longest commute.

Upper quartile The value of the 75th percentile, or the median of the upper half of the data .

Interquartile range The area of the box, between the upper and lower quartiles, where the middle half of the data lies.

Median The value of the 50th percentile, or median, of the data.

Lower quartile The value of the 25th percentile, or the median of the lower half of the data.

Lowest value The line shows the lowest value in the data set not including any outliners, in this case the shortest commute.

To make sure your reader interprets your graph correctly, you must label it well!

You should include:

• a sensible title that summaries the variables being described

• a label on every axis with the name of the variable and the units (if appropriate)

• a key (or legend), if the chart has groups of lines or boxes that have different meanings

Remember: a chart without labels is just a picture!

Labelling your charts

Creating a chart will give a visual summary of your data, but you might wish to use

descriptive statistics, or numerical summaries, for further analysis. Examples of descriptive

statistics include the median, mean, mode, range and standard deviation, which we covered

earlier. It can also take the form of a frequency table.

A frequency table presents a breakdown of the data between different categories. It is

basically a summary of the numbers behind bar chart or pie chart; it shows the data in

multiple columns as frequency counts, percentages or both. A simple one-way frequency

table just covers one variable.

Frequency tables

You can use an analysis tool to calculate descriptive statistics, such as the median. We

learned earlier when it is appropriate to do so; it depends on the level of measurement of

your data.

You can also use an analysis tool to create a frequency table. Again, it is not always

appropriate to do so.

Using frequency tables

WHEN CAN I USE A FREQUENCY TABLE?

• If your data is nominal, you can only use the mode as a descriptive statistic, so create a frequency table to look at the distribution between categories.

• If your data is ordinal, you can use the mode, median

and range as descriptive statistics, but you can also create a frequency table to look at the distribution between categories.

Using frequency tables

WHEN SHOULDN’T I USE A FREQUENCY TABLE?

• If your data is scale you have the option of mean, mode, median, range and standard deviation for numerical summary descriptive statistics, which can be really useful for statistical tests. If you were to create a frequency table, the number of rows might be huge, e.g. one for every year if you are collecting ages, so it is usually not useful.

When you have a data set it is important that you use a logical structure so the data can be easily

interpreted and accurately represented.

This section will explore best practices for using and organising your data.

Understanding your data

Data that is easy to understand is easier to analyse. There are several things you must

consider in order to produce consistent and reliable data.

Structure

Spelling

Check your spellings and typing. It will help you to avoid incorrectly

categorised data or creating duplicate categories. Should “small” and

“smll” really be two different categories?

Duplication

Check for duplication. Have the same data values been added in

different ways by the respondents? For example, if your data is

from a survey, have you made sure that “car” and “automobile” are

not two different categories?

Structure

Notes and labels

Use clear notes and labels. Make sure the logic behind your notes

is clear to an outsider; people need to be able to understand your

data! Will you remember what “exp1” and “exp2” meant when you

look at your work in a few months? What about other people who

you work with?

Blanks

Carefully explain the blanks. You should distinguish between

readings that are missing because a respondent refused to answer

and readings that are missing because that question did not apply

to that respondent. For example, use “-1” and “-2” respectively.

When data is recorded formally in a statistical analysis tool, database or programming tool,

you must choose the data type or format for each variable. This affects both how it is stored

and how it is displayed.

Typically, you may choose one of the following data types for each variable.

Data types

123 - Numeric

This is a common number format that may or may not include

decimal places.

£123 - Currency

Monetary formats.

Data types

6.4E-5 - Scientific notation

This is for very large or very small numbers, where aEb means “a

multiplied by 10 to the power of b”.

01/01/2014 - Date

There are usually various date and time formats to choose from;

standards for this vary in different parts of the world.

String abacus

Text format, used for qualitative data that is not categorical and

cannot be represented any other way.

Statistics allow you to organise, analyse and summarise data. Statistics can make your data

more accessible; enabling more people to understand and engage with your data.

When using data and statistics remember to be organised and methodical, take time to

ensure you enter and label your data correctly. Data that isn’t labelled or isn’t labelled clearly

has little value.

Summary

Glossary

Levels of measurement

Type of data we are collecting or analysing. Nominal, ordinal and

scale are examples of levels of measurement.

Value One recording or single piece of data.

Variable

A list of values that all relate to the same question or item being

measured. For example, variables for a survey could be age, date

of birth or gender.

Data set

A structured collection of values, responses or readings which may

be organised by variables. If you conducted a survey and collated

the results, this would be a data set.

Glossary

Analytical software

A piece of software that can provide commands to modify,

summarise and analyse data sets. SPSS is an example of analytical

software you would use to analyse your data.

Binning

Configuring a histogram to display the bars in groups. Having too

few bins can display your data incorrectly, having too many can

make your graphs look overcrowded and cluttered.

Descriptive statistics

Numerical summaries of your data, such as frequency counts and

averages.

Mode The value that occurs most often in the set.

Glossary

Median The value in the middle of the set.

Mean The average considering the whole set.

Range The gap between the highest and lowest values.

Standard deviation

A calculated number to describe the shape of the data either side

of the mean.

Qualitative data

Data that describes or characterises but does not measure a thing.

For example, in a survey, if you entered your name this is

qualitative as it is describing you and has no numeric value.

Quantitative data

Data that measures or defines a thing and can be quantified and

verified. Your income and age are examples of quantitative data.