Univariate Analysis

Preview:

DESCRIPTION

 

Citation preview

Univariate Analysis

Simple Tools for Description

POLI 399/691 - Fall 2008 2Topic 6

Description of Variables Univariate analysis refers to the analysis

of one variable Several statistical measures can be

employed to describe data Allows for comparison across variables

measured in different unitsProvides parsimony: one or two statistics can

help us understand a large number of cases

POLI 399/691 - Fall 2008 3Topic 6

ProportionShare of cases relative to the whole

population; Range is from 0 to 1E.g. if 50 women in sample of 125, then

proportion of women is 50/125=0.4 Percentage is the proportion multiplied by

100E.g. if proportion is .40, then percentage

is .40x100=40%

Basic descriptive tools

POLI 399/691 - Fall 2008 4Topic 6

Percentage change allows us to calculate the relative change in a variable over some period of time Percentage change is:

Time 2 – Time 1 x 100 Time 1

E.g. in 1993 women made up 48% of the population and in 2003 this percentage had risen to 51%. What is the percentage change from 1993 to 2003?

((51-48)/48)x100=(3/48)x100=6.25% (it is not 3%)

Percentage point difference is the absolute change between percentage at time 1 and percentage at time 2 Using the same example, the percentage point difference in the

share of women in the population between 1993 and 2003 is 3 percentage points (X2-X1) (it is not 3%)

POLI 399/691 - Fall 2008 5Topic 6

Frequency Table

The frequency table (or frequency distribution) is commonly used to provide a “snapshot” of a variable

Made up of 4 columns: Values (categories) of the variable The number of cases The percentage of cases The cumulative percentage of cases

Consider collapsing categories if the variable has a large number of values/categories

POLI 399/691 - Fall 2008 6Topic 6

Table 1: Frequency Table of Grouped Data – Ages of Respondents

Age Group Frequency Percentage

Cumulative Percentage

18-24 36 15.0 15.0

25-34 44 18.3 33.3

35-44 43 17.9 51.2

45-54 46 19.2 70.4

55-64 34 14.2 84.6

65 and over 37 15.4 100.0

Total 240 100.0% 100.0%

Source: Hypothetical Data, 2005.

POLI 399/691 - Fall 2008 7Topic 6

Bar charts, pie charts and line graphs

Bar charts or pie charts are good for showing the variation in the percentage of cases for each value of a variablePie chart – compare parts to the wholeBar graphs to compare categories/values

Line chart is good for longitudinal dataReveals trends over time

POLI 399/691 - Fall 2008 8Topic 6

Figure 1: Federal Expenditures by Sector

38

25

127 4

14

0

10

20

30

40

50

60

Expenditure Type

Perc

en

tage

Social Public Debt Fiscal Arrangements Defence Gov't Operations Other

Source: Hypothetical Data, 2006

POLI 399/691 - Fall 2008 9Topic 6

Figure 2: Federal Expenditures by Sector

38%

25%

12%

7%

4%

14% Social

Public Debt

FiscalArrangementsDefence

Gov't Operations

Other

Source: Hypothetical Data, 2006

POLI 399/691 - Fall 2008 10Topic 6

Figure 2: Share of Women among Party Leaders Selected by Year, 1980-2005

0

5

10

15

20

25

30

35

40

45

50

Per

cen

tag

e

Percentage per year 3 per. Mov. Avg. (Percentage per year)

Source: O’Neill and Stewart, “Gender and Political Party Leadership in Canada,” Party Politics, forthcoming.

POLI 399/691 - Fall 2008 11Topic 6

Religious Volunteers

All OtherVolunteers

Non-Volunteers

Voted in last federal election 83.7 80.8 71.6

Voted in last provincial election 82.6 79.2 70.6

Voted in last municipal election 72.8 67.4 58.0

Follow news or current affairs daily 70.2 66.8 65.7

N (over 18 only) (509) 537 (1603) 1745 (5346)

Note: Entries are percentage of respondents who reported engaging in said activity. All differences across the three groups are statistically significant (p<.01). Differences between religious and other volunteers in reported municipal voting statistically significant (p< .05).

Table 8: Political Participation by Volunteer Type

Source: Brenda O’Neill, “Canadian Women’s Religious Volunteerism: Compassion, Connections and Comparisons” in B. O’Neill and E. Gidengil, Gender and Social Capital, New York: Routledge, 2006.

POLI 399/691 - Fall 2008 12Topic 6

Checklist for Charts and Tables Have you chosen the proper type of chart? Have you provided a clear, descriptive title? (Note the

difference between “Table” and “Figure”) Is the data source noted in a footnote? Are statistical tests reported in a footnote? For Bivariate tables, is the dependent variable on the

vertical axis? The independent on the horizontal? Are the axes properly labelled? Will colour choices matter if printed in black and white? Have you provided values in bar/pie charts? Does the length of the axes distort the result? Have you referred to and explained the table/chart in the

text?

POLI 399/691 - Fall 2008 13Topic 6

Measures of Central Tendency

Measures of central tendency allow us to speak of some “standard” case for all the cases in the sample or population What is the most common unit? Is there some pattern in the data?

Three different measures: mean, median and mode Nominal data? Use mode Ordinal data? Use mode and/or median Interval data? Use mode, median and/or mean

The mean provides the most information; the mode, the least Always use the statistic that provides the most information; goal

is parsimony

POLI 399/691 - Fall 2008 14Topic 6

Mode For nominal data, the mode is the measure of

the “standard” or “most common” case The mode is simply that category of the variable

that occurs the most often (i.e. has the most cases)

The mode is the “best guess” for nominal data The utility of this statistic is limited

Can change dramatically with the addition of a few cases (not very stable)

Tells us about the most common value but little else

POLI 399/691 - Fall 2008 15Topic 6

Figure 1: Federal Expenditures by Sector

38

25

127 4

14

0

10

20

30

40

50

60

Expenditure Type

Perc

en

tage

Social Public Debt Fiscal Arrangements Defence Gov't Operations Other

Source: Hypothetical Data, 2006

← Mode is Social Expenditures

POLI 399/691 - Fall 2008 16Topic 6

Median Use with ordinal data Indicates the middle case in an ordered

set of cases – the midpoint To determine the median, order the data

from lowest to highest and the median is the value of the middle caseEven number of cases? Take the average of

the two middle values (add them together and divide by 2)

POLI 399/691 - Fall 2008 17Topic 6

Mean The mean describes the centre of gravity of interval data

Commonly called the average Easily allows one to locate a case relative to all others

Where is a case located in relation to all the others? Above average? Below average?

To calculate: ΣXi/n=(X1+X2+…+Xi)/n where i=number of cases

Reliable but sensitive to outliers (cases that are much larger or much smaller than the rest) Median provides a better sense of the most common case when

there are outliers

POLI 399/691 - Fall 2008 18Topic 6

Example: Income data

For these data, the mean is $1,039,700 and the median is $36,5000

We call a distribution with outliers a skewed distribution

Income for 10 cases

$24,000

$25,000

$28,000

$30,000

$35,000

$38,000

$56,000

$75,000

$86,000

$10,000,000

Median →

Mean →

POLI 399/691 - Fall 2008 19Topic 6

Measures of Dispersion Once you know the standard case, you should also know

how standard the case is – that is, how well does this one case represent all the cases?

For nominal data, there is no measure of dispersion; one could simply indicate how many categories exist

For ordinal data, the range provides some information about the spread of data The range is simply the highest value minus the lowest value

When we have outliers the range gives a distorted picture of the data E.g. for our income data, the range is $10,000,000-$24,000 =

$9,976,000

POLI 399/691 - Fall 2008 20Topic 6

For interval data, we use the standard deviation A measure of the average deviation of a case from the mean

value A deviation is the distance and direction of any raw score from

the mean The larger the deviation, the further the score from the mean The deviation can be either positive or negative (larger or smaller

than the mean value)

The mean is that value where the sum of negative deviations equals the sum of positive deviations

You want to calculate the average size of these deviations but we need to ‘fix’ the problem of the deviations summing to 0

To fix the problem, we square each deviation before we sum them, and then take the square root of the total

POLI 399/691 - Fall 2008 21Topic 6

Formula for standard deviation

N

XXds i

2

..

Note: N-1 is employed for a sample

POLI 399/691 - Fall 2008 22Topic 6

To calculate the standard deviation: Calculate the mean Subtract the mean from each value (these are the

deviations) Square each of the deviations Sum them (add them together) Divide this sum by the number of cases (to get the

average squared deviation) Compute the square root of average squared

deviation

POLI 399/691 - Fall 2008 23Topic 6

Table 8.10 Computation of Standard Deviation, Beth’s Grades

SUBJECT GRADE

Sociology 66 66 – 82 = –16 256

Psychology 72 72 – 82 = –10 100

Political science 88 88 – 82 = 6 36

Anthropology 90 90 – 82 = 8 64

Philosophy 94 94 – 82 = 12 144

MEAN 82.0 TOTAL 600

1 N

)XX( 2

sd

4

600sd

25.12sd

Note: The “N – 1” term is used when sampling procedures have been used. When population values are used the denominator is “N.” SPSS uses N – 1 in calculating the standard deviation in the DESCRIPTIVES procedure.

xx 2)x(x

POLI 399/691 - Fall 2008 24Topic 6

The result is always a positive number but you can think of the average deviation as occurring either positively or negatively

The last measure to review is the variance Variance is simply the square of the standard

deviation Variance and standard deviation are easily

calculated by software programs Good to calculate it on your own for small samples to

get a “feel” for the statistic These are two statistics that will be used again

for other calculations

POLI 399/691 - Fall 2008 25Topic 6

The smaller the standard deviation, the tighter the cases are around the meanThe mean is a “better” predictor of scores

when the standard deviation is small Like the mean, the standard deviation is

also sensitive to outliers Describing data effectively requires

information on both the mean and the standard deviation

POLI 399/691 - Fall 2008 26Topic 6

Statistics and SPSS

Statistic Nominal Ordinal Interval

Central Tendency

Mode Mode

Median

Mode

Median

Mean

Dispersion

-- Range Range

Standard Deviation Variance

SPSS Commands

(options)

Frequencies

(mode)

Frequencies

(range, median)

Descriptives

(all)

Source: Jackson and Verberg, p.222.

POLI 399/691 - Fall 2008 27Topic 6

Z Scores (or standardized scores)

A Z score represents the distance from the mean, in standard deviation units, of any value in a distribution

Z scores are comparable across different populations and different units because they are offered in standard units

The Z score formula is as follows:

sd

XXZ

POLI 399/691 - Fall 2008 28Topic 6

A negative z-score means the case falls below the mean; a positive one means it lies above the mean A z-score of 0 means ….? The larger the score, the further from the mean

Useful when combining variables with very different ranges into indexes Transform into Z scores and then create the index

To obtain Z scores in SPSS Select Analyze → Descriptive Statistics

→Descriptives Select one or more variables Check “Save standardized values as variables” to save z scores

as new variables. They will be the last variables in the variable view screen

POLI 399/691 - Fall 2008 29Topic 6

Key terms

Proportion Percentage Percentage change Percentage point

difference Bar chart Pie chart Frequency table Cumulative percentage Mean

Median Mode Outlier Skewed distribution Measures of variation Range Standard deviation Variance Standardized (Z) scores

Recommended