Advanced statistics for librarians

Advanced Statistics for Advanced Statistics for LibrariansLibrarians

How to use and evaluate statistical information in library research

John McDonaldJohn McDonald Jason PriceJason Price

Acquisitions LibrarianAcquisitions LibrarianScience & Electronic Science & Electronic

Resources LibrarianResources Librarian

CaltechCaltech Claremont CollegesClaremont Colleges

Advanced StatisticsAdvanced Statistics

Part I: Research Design

Part II: Statistical Concepts

Part III: Evaluating Library Statistics

Research DesignResearch Design

ValidityHow well an indicator accurately measures the concept being studied. Is the technique appropriate to measure the concept being studied?

ReliabilityHow consistent is the measurement. Does it yield the same results over repeated attempts and by different researchers? How certain are the results?

GeneralizabilityHow well (or likely) can the findings be applied to other situations?

Research Design StepsResearch Design Steps

Research Question

Hypotheses

Data definitions

Data collection

Data analysis

Conclusions

Research QuestionResearch Question

What is the study designed to answer?

Why is the study important?

The more specific, the better!

Example: Should the library increase hours during finals week?

HypothesisHypothesis

A statement about the expected results.

What you will test after collecting data.

Null Hypothesis, that there is no difference between Group 1 & Group 2 or Before/After. Notated Ho = Ha

Alternate Hypothesis, that there is a difference and what that difference will be. Notated Ho ≠ Ha

Can also be directional if theory or prior research indicates : Ho > Ha

Data collectionData collection

Observation

Interviews

Focus Groups

Surveys

Transaction Logs

Others?

Data Collection: SamplingData Collection: Sampling

Necessary when it is impossible to study an entire population due to logical, geographical, monetary, or time constraints.

A sample must be a good representation of the rest of the population.

The larger your sample, the more sure you can be that their answers truly reflect the population

Accuracy increases when more respondents pick one choice over another. E.g. More accuracy when 99% choose one presidential candidate

The larger your population size, the larger your sample needs to be, except if your population is very large (i.e. the U.S., or very small (i.e. your household)

Simple Stratified

Assumes homogeneity Assumes heterogeneity

Sampling DesignsSampling Designs

1) SS = Z2 * (p) * (1-p) / c2

2) ss = SS/1+(SS-1/pop)

1)When you have very large pop size2)When you have finite pop sizeZ = Z value (e.g. 1.96 for 95% confidence level) p = percentage picking a choice, expressed as

decimal (e.g. .5 for 50%)c = confidence interval, expressed as decimal

(e.g., .04 = ±4%)

Sample size spreadsheet

Calculating Sample SizesCalculating Sample Sizes

• Research Question: What is the color distribution of M&Ms?

• Sample: What is the color distribution of a simple random sample of M&Ms.

• Test: Does my sample yield different results than what is reported by the company?

• Method: Packages of M&Ms distributed to each participant. Each package is a random sample from the company.

M&M SamplingM&M Sampling

Let’s look at the colors in individual samples of M&Ms

M&M Data Collection & Testing

Red Blue Green Yellow Orange BrownJohn 3 2 1 7 3 4

M&M SamplingM&M Sampling

Data DefinitionsData DefinitionsData Scales

NominalOrdinalIntervalRatio

Frequency DistributionsFlatNormalSkewed

Variable TypesDependentIndependentExtraneous

Data ScalesData ScalesNominal: scaled without order, indicating that classifications are different. Example: Public & private institutions.

Ordinal: scaled with order, but without distance between values. Example: Carnegie classifications

Interval: scaled with order and establishes numerically equal distances on the scale. Example: Grade level (freshman, sophomore, etc.)

Ratio: scaled with equal intervals and a zero starting point. Example: Fulltext downloads.

Nominal or ordinal variables are discrete, while interval and ratio variables are continuous

Name that data type!Name that data type!Salary

Author of a book

Hours spent in the library

Patron status

Publication year of a journal

Ranked journal lists

Test results on instruction classes

Number of articles read

FTE

Data DistributionsData DistributionsDescribed by their kurtosis (variability) and skew (extremes)

Normal: bell shaped curve with gradual slopes

Non-normal (skewed): extreme values with steep slopes

Fulltime Students at ARL SchoolsFulltime Students at ARL Schools

0 10,000 20,000 30,000 40,000 50,000 60,000Total Fulltime Students

0

5

10

15

20

25

Mean = 22171.55Std. Dev. = 10075.743N = 114

N=114

Mean = 22K

SD = 10K

Total Salaries & Wages at ARL LibrariesTotal Salaries & Wages at ARL Libraries

10 20 30 40 50Total Salaries & Wages (million dollars)

0

10

20

30

40

Mean = 10153401.64Std. Dev. = 6534491.237N = 114

N=114

Mean = 10M

SD = 6.5M

VariablesVariables

Dependent: the variable being measured, studied, and predicted.

Independent: variables that can be manipulated or are predictors of the dependent variable.

Extraneous: variables other than the independent variables that can influence the dependent variable.

Data analysisData analysis

Descriptive statisticsMean, Median, ModeStandard Deviation

Correlational statisticsCorrelation

Inferential statisticsT-testRegressionChi-square ANOVA

Correlational StatisticsCorrelational StatisticsCorrelation establishes that two measures have a relationship.

Indicates direction & strength, but not causation!Allows researcher to consider other statistical tests with confidence.Requirements

random sampleinterval or ratio datanormal distributionlinear relationship

Correlational StatisticsCorrelational StatisticsDirection

Positive: As one value increases, the other does as well.

Example: Age and height.Library: Enrollment & materials budget.

Negative: As one value increases, the other decreases.

Example: Car speed & time to destination.Library: Items purchased & shelf space.

Strength

Value between 1 (positive) and -1 (negative). The closer to those values, the stronger the relationship.

CorrelationCorrelation

Inferential StatisticsInferential StatisticsParametric: assume that the dependent variable has a known underlying mathematical distribution (normal, binomial, Poisson, etc.) which serves as the basis for sample-to-population estimates. Parametric tests are robust and have great power efficiency.

Non-parametric: do not assume a normal distribution (distribution free) & require that the data meet fewer assumptions. Allow for the analysis of a mixture of data types.

T-TestT-TestDetermine if there is a difference (in a characteristic) between two populations based on data from samples of those populations.

Requirementsrandom sampleinterval or ratio datanormal distributionequal standard deviations

T-TestT-Test

RegressionRegression

Predicts values of a dependent variable based on values of independent (predictor) variables

Requirements: interval or ratio datanormal distributioncorrelated variableslinear relationship

RegressionRegression

ANOVAANOVADetermine if there are differences between three or more sample means.

Test the significance and direction of the difference.

Requirements: normal distribution (in each cell)Interval or ratio datahomogeneity of variance

ANOVAANOVA

Chi Square TestChi Square Test

Difference between expected and observed frequencies for nominal or ordinal data

Requirements: Any type of dataLarge sample size (>50)Similar distributions

Chi Square TestChi Square TestPepsi ChallengeObserved: Pepsi 85, Coke 57, RC 78 Expected (equal) = 73.33

O E O-E (O-E)2 (O-E)2/E

Pepsi 85 73.33 11.67 136.19 1.86

Coke 57 73.33 -16.33 266.67 3.64

RC 78 73.33 4.67 21.81 0.3

Totals 220 219.99 χ2 = 5.8

Degrees of freedom = rows - 1 = 3 - 1 = 2Critical value of χ2 = 5.99 at alpha = 0.05Observed value of χ2 = 5.8Decision: Fail to reject H0

Inferential StatisticsInferential StatisticsGoal Parametric Non-parametric

Compare sample to a hypothetical value

T-test •Wilcoxon test•Chi-Square

Quantify association between variables

Pearson correlation

•Spearman correlation•Kendall's tau

Predict value from measured variables

OLS Regression •Poisson regression•Negative Binomial reg.

Compare 2 unpaired groups

Unpaired t-test •Mann-Whitney test •Fisher's test

Compare 2 paired groups

Standard two-group t-test

•Mann-Whitney•Komogorov-Smirnov

Compare 3+ unmatched groups

ANOVA •Kruskal-Wallace test •Chi-square test

Review: Research DesignReview: Research Design Research Question

What will the study answer?

HypothesesWhat do you think the results will be?

Data definitionsWhat scales are the variables, what is the distribution, and what are the dependent, independent & extraneous variables?

Data collectionWhat is the best method for collecting the variables of interest?

Data analysis What are the proper statistical tests to use on the data?

ConclusionsWhat does the data show us or indicate?

Case StudiesCase Studies

Citation AnalysisAntelman, K (2004) “Do Open-Access Articles Have a Greater Research Impact?” College & Research Libraries News 65(5):pp. 372-382

Usage AnalysisBlecic, DD (1999) “Measurements of journal use: an analysis of the correlations between three methods.” Bull Med Libr Assoc 87(1): 20-25.

Service AnalysisNichols, J; Shaffer, B; Shockey, K. (2003). “Changing the Face of Instruction: Is Online or In-class More Effective?” College & Research Libraries, 64:5: 378-389.

““Changing the Face of Instruction…”Changing the Face of Instruction…”

Is an online tutorial as effective in teaching library instruction as a classroom setting?

H3. Students will report as much or more satisfaction with online instruction as students taking traditional instruction.

Research Question

Hypotheses H1. Students will have higher scores in information literacy tests after library instruction.

H2. Students will have the same or higher scores in info-lit tests after taking online tutorials as students taking traditional instruction.

Variables: Test scores & survey resultsData Collection: Pretest/Posttest & Survey

““Changing the Face of Instruction…”Changing the Face of Instruction…”

Variables & Data Collection

Statistical Tests

Conclusions Accept H1: Instruction improves literacy.

Desc Stats incl. mean, standard deviation, standard error, T-tests (1 & 2 tailed)

Accept H3 alternative hypothesis – Student satisfaction is equal with both methods.

Accept H2 alternative hypothesis – Online has no significant difference from traditional.

“Do Open-Access Articles…”

Research Question

Hypothesis

Variables and Data Collection

Statistical Tests

Conclusions

Critical Questions


Do freely available articles have a greater research impact?

Research impact: citation rates

Open Access: freely available

Research Question

Hypotheses H1. Scholarly articles have a greater research impact if the articles are freely available online than if they are not.

Ho: (null hypothesis): There is no difference between the mean citation rates: Ho: d1 = d0

Measures

Variables: Mean citation ratesData Collection: At least 50 articles from 10 leading journals in 4 disciplines.


Variables & Data Collection

Statistical Tests

Conclusions Reject Ho: Open Access articles are citation more than those that are not OA.

Desc Stats incl. mean, standard deviation, standard error, Wilcoxon sign-rank

Validity? Reliability of Measures?Generalizability?Alternate hypotheses?

Discussion

My favorite statistic…

Baseball is 90% mental – the other half is physical.

Education

Advanced statistics for librarians