Upload
john-mcdonald
View
1.328
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Preconference workshop, Charleston Conference, November 7, 2006
Citation preview
Advanced Statistics for Advanced Statistics for LibrariansLibrarians
How to use and evaluate statistical information in library research
John McDonaldJohn McDonald Jason PriceJason Price
Acquisitions LibrarianAcquisitions LibrarianScience & Electronic Science & Electronic
Resources LibrarianResources Librarian
CaltechCaltech Claremont CollegesClaremont Colleges
Advanced StatisticsAdvanced Statistics
Part I: Research Design
Part II: Statistical Concepts
Part III: Evaluating Library Statistics
Research DesignResearch Design
ValidityHow well an indicator accurately measures the concept being studied. Is the technique appropriate to measure the concept being studied?
ReliabilityHow consistent is the measurement. Does it yield the same results over repeated attempts and by different researchers? How certain are the results?
GeneralizabilityHow well (or likely) can the findings be applied to other situations?
Research Design StepsResearch Design Steps
Research Question
Hypotheses
Data definitions
Data collection
Data analysis
Conclusions
Research QuestionResearch Question
What is the study designed to answer?
Why is the study important?
The more specific, the better!
Example: Should the library increase hours during finals week?
HypothesisHypothesis
A statement about the expected results.
What you will test after collecting data.
Null Hypothesis, that there is no difference between Group 1 & Group 2 or Before/After. Notated Ho = Ha
Alternate Hypothesis, that there is a difference and what that difference will be. Notated Ho ≠ Ha
Can also be directional if theory or prior research indicates : Ho > Ha
Data collectionData collection
Observation
Interviews
Focus Groups
Surveys
Transaction Logs
Others?
Data Collection: SamplingData Collection: Sampling
Necessary when it is impossible to study an entire population due to logical, geographical, monetary, or time constraints.
A sample must be a good representation of the rest of the population.
The larger your sample, the more sure you can be that their answers truly reflect the population
Accuracy increases when more respondents pick one choice over another. E.g. More accuracy when 99% choose one presidential candidate
The larger your population size, the larger your sample needs to be, except if your population is very large (i.e. the U.S., or very small (i.e. your household)
Simple Stratified
Assumes homogeneity Assumes heterogeneity
Sampling DesignsSampling Designs
1) SS = Z2 * (p) * (1-p) / c2
2) ss = SS/1+(SS-1/pop)
1)When you have very large pop size2)When you have finite pop sizeZ = Z value (e.g. 1.96 for 95% confidence level) p = percentage picking a choice, expressed as
decimal (e.g. .5 for 50%)c = confidence interval, expressed as decimal
(e.g., .04 = ±4%)
Sample size spreadsheet
Calculating Sample SizesCalculating Sample Sizes
• Research Question: What is the color distribution of M&Ms?
• Sample: What is the color distribution of a simple random sample of M&Ms.
• Test: Does my sample yield different results than what is reported by the company?
• Method: Packages of M&Ms distributed to each participant. Each package is a random sample from the company.
M&M SamplingM&M Sampling
Let’s look at the colors in individual samples of M&Ms
M&M Data Collection & Testing
Red Blue Green Yellow Orange BrownJohn 3 2 1 7 3 4
M&M SamplingM&M Sampling
Data DefinitionsData DefinitionsData Scales
NominalOrdinalIntervalRatio
Frequency DistributionsFlatNormalSkewed
Variable TypesDependentIndependentExtraneous
Data ScalesData ScalesNominal: scaled without order, indicating that classifications are different. Example: Public & private institutions.
Ordinal: scaled with order, but without distance between values. Example: Carnegie classifications
Interval: scaled with order and establishes numerically equal distances on the scale. Example: Grade level (freshman, sophomore, etc.)
Ratio: scaled with equal intervals and a zero starting point. Example: Fulltext downloads.
Nominal or ordinal variables are discrete, while interval and ratio variables are continuous
Name that data type!Name that data type!Salary
Author of a book
Hours spent in the library
Patron status
Publication year of a journal
Ranked journal lists
Test results on instruction classes
Number of articles read
FTE
Data DistributionsData DistributionsDescribed by their kurtosis (variability) and skew (extremes)
Normal: bell shaped curve with gradual slopes
Non-normal (skewed): extreme values with steep slopes
Fulltime Students at ARL SchoolsFulltime Students at ARL Schools
0 10,000 20,000 30,000 40,000 50,000 60,000Total Fulltime Students
0
5
10
15
20
25
Mean = 22171.55Std. Dev. = 10075.743N = 114
N=114
Mean = 22K
SD = 10K
Total Salaries & Wages at ARL LibrariesTotal Salaries & Wages at ARL Libraries
10 20 30 40 50Total Salaries & Wages (million dollars)
0
10
20
30
40
Mean = 10153401.64Std. Dev. = 6534491.237N = 114
N=114
Mean = 10M
SD = 6.5M
VariablesVariables
Dependent: the variable being measured, studied, and predicted.
Independent: variables that can be manipulated or are predictors of the dependent variable.
Extraneous: variables other than the independent variables that can influence the dependent variable.
Data analysisData analysis
Descriptive statisticsMean, Median, ModeStandard Deviation
Correlational statisticsCorrelation
Inferential statisticsT-testRegressionChi-square ANOVA
Correlational StatisticsCorrelational StatisticsCorrelation establishes that two measures have a relationship.
Indicates direction & strength, but not causation!Allows researcher to consider other statistical tests with confidence.Requirements
random sampleinterval or ratio datanormal distributionlinear relationship
Correlational StatisticsCorrelational StatisticsDirection
Positive: As one value increases, the other does as well.
Example: Age and height.Library: Enrollment & materials budget.
Negative: As one value increases, the other decreases.
Example: Car speed & time to destination.Library: Items purchased & shelf space.
Strength
Value between 1 (positive) and -1 (negative). The closer to those values, the stronger the relationship.
CorrelationCorrelation
Inferential StatisticsInferential StatisticsParametric: assume that the dependent variable has a known underlying mathematical distribution (normal, binomial, Poisson, etc.) which serves as the basis for sample-to-population estimates. Parametric tests are robust and have great power efficiency.
Non-parametric: do not assume a normal distribution (distribution free) & require that the data meet fewer assumptions. Allow for the analysis of a mixture of data types.
T-TestT-TestDetermine if there is a difference (in a characteristic) between two populations based on data from samples of those populations.
Requirementsrandom sampleinterval or ratio datanormal distributionequal standard deviations
T-TestT-Test
RegressionRegression
Predicts values of a dependent variable based on values of independent (predictor) variables
Requirements: interval or ratio datanormal distributioncorrelated variableslinear relationship
RegressionRegression
ANOVAANOVADetermine if there are differences between three or more sample means.
Test the significance and direction of the difference.
Requirements: normal distribution (in each cell)Interval or ratio datahomogeneity of variance
ANOVAANOVA
Chi Square TestChi Square Test
Difference between expected and observed frequencies for nominal or ordinal data
Requirements: Any type of dataLarge sample size (>50)Similar distributions
Chi Square TestChi Square TestPepsi ChallengeObserved: Pepsi 85, Coke 57, RC 78 Expected (equal) = 73.33
O E O-E (O-E)2 (O-E)2/E
Pepsi 85 73.33 11.67 136.19 1.86
Coke 57 73.33 -16.33 266.67 3.64
RC 78 73.33 4.67 21.81 0.3
Totals 220 219.99 χ2 = 5.8
Degrees of freedom = rows - 1 = 3 - 1 = 2Critical value of χ2 = 5.99 at alpha = 0.05Observed value of χ2 = 5.8Decision: Fail to reject H0
Inferential StatisticsInferential StatisticsGoal Parametric Non-parametric
Compare sample to a hypothetical value
T-test •Wilcoxon test•Chi-Square
Quantify association between variables
Pearson correlation
•Spearman correlation•Kendall's tau
Predict value from measured variables
OLS Regression •Poisson regression•Negative Binomial reg.
Compare 2 unpaired groups
Unpaired t-test •Mann-Whitney test •Fisher's test
Compare 2 paired groups
Standard two-group t-test
•Mann-Whitney•Komogorov-Smirnov
Compare 3+ unmatched groups
ANOVA •Kruskal-Wallace test •Chi-square test
Review: Research DesignReview: Research Design Research Question
What will the study answer?
HypothesesWhat do you think the results will be?
Data definitionsWhat scales are the variables, what is the distribution, and what are the dependent, independent & extraneous variables?
Data collectionWhat is the best method for collecting the variables of interest?
Data analysis What are the proper statistical tests to use on the data?
ConclusionsWhat does the data show us or indicate?
Case StudiesCase Studies
Citation AnalysisAntelman, K (2004) “Do Open-Access Articles Have a Greater Research Impact?” College & Research Libraries News 65(5):pp. 372-382
Usage AnalysisBlecic, DD (1999) “Measurements of journal use: an analysis of the correlations between three methods.” Bull Med Libr Assoc 87(1): 20-25.
Service AnalysisNichols, J; Shaffer, B; Shockey, K. (2003). “Changing the Face of Instruction: Is Online or In-class More Effective?” College & Research Libraries, 64:5: 378-389.
““Changing the Face of Instruction…”Changing the Face of Instruction…”
Is an online tutorial as effective in teaching library instruction as a classroom setting?
H3. Students will report as much or more satisfaction with online instruction as students taking traditional instruction.
Research Question
Hypotheses H1. Students will have higher scores in information literacy tests after library instruction.
H2. Students will have the same or higher scores in info-lit tests after taking online tutorials as students taking traditional instruction.
Variables: Test scores & survey resultsData Collection: Pretest/Posttest & Survey
““Changing the Face of Instruction…”Changing the Face of Instruction…”
Variables & Data Collection
Statistical Tests
Conclusions Accept H1: Instruction improves literacy.
Desc Stats incl. mean, standard deviation, standard error, T-tests (1 & 2 tailed)
Accept H3 alternative hypothesis – Student satisfaction is equal with both methods.
Accept H2 alternative hypothesis – Online has no significant difference from traditional.
“Do Open-Access Articles…”
Research Question
Hypothesis
Variables and Data Collection
Statistical Tests
Conclusions
Critical Questions
“Do Open-Access Articles…”
Do freely available articles have a greater research impact?
Research impact: citation rates
Open Access: freely available
Research Question
Hypotheses H1. Scholarly articles have a greater research impact if the articles are freely available online than if they are not.
Ho: (null hypothesis): There is no difference between the mean citation rates: Ho: d1 = d0
Measures
Variables: Mean citation ratesData Collection: At least 50 articles from 10 leading journals in 4 disciplines.
“Do Open-Access Articles…”
Variables & Data Collection
Statistical Tests
Conclusions Reject Ho: Open Access articles are citation more than those that are not OA.
Desc Stats incl. mean, standard deviation, standard error, Wilcoxon sign-rank
Validity? Reliability of Measures?Generalizability?Alternate hypotheses?
Discussion
My favorite statistic…
Baseball is 90% mental – the other half is physical.