View
11
Download
0
Category
Preview:
Citation preview
R: statistics for all of usR: an international statistical collaboratory
Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005
William Revelle, Northwestern Universitypersonality-project.org/r/
R: statistics for all of us
• What is it?
• Why use it?
• Common (mis)perceptions
• Examples for personality and individual differences research
R: What is it?
•R: An international collaboration
•R: the open source - public domain version of S+
•R: written by statisticians (and all of us) for statisticians (and the rest of us)
•R: an extensible language
Common statistical programs
General SpecializedR AMOSS+ EQSSAS LISRELSPSS Plus
STATA MxSYSTAT your favorite program.
Common statistical programsmost are costly
General SpecializedR AMO$$+ EQ$
$A$ LI$REL$P$$ Maple$
$TATA Mx$Y$TAT your favorite program.
R: a way of thinking(from the R point of view)
• “R is the lingua franca of statistical research. Work in all other languages should be discouraged.”
• “This is R. There is no if. Only how.”
• “Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities (last year it was about 10 years behind) in my estimation.”
Taken from the R.-fortunes (selections from the R.-help list serve)
But it is open source - how can you trust it?
• Q: When you use it [R], since it is written by so many authors, how do you know that the results are trustable?
• A:The R engine [...] is pretty well uniformly excellent code but you have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it.
Taken from the R.-fortunes (selections from the R.-help list serve)
What is R? : Technically
• R is an open source implementation of S (S-Plus is a commercial implementation)
• R is available under GNU Copy-left
• The current version of R is 2.1.1
• R is group project run by a core group of developers (with new releases semiannually)
• (Adapted from Robert Gentleman)
R: History• 1991-93: Ross Dhaka and Robert Gentleman begin
work on R project at U. Auckland
• 1995: R available by ftp under the GPL
• 96-97: mailing list and R core group is formed
• 2000: John Chambers, designer of S joins the R core (wins a prize for best software from ACM for S)
• 2001-2005: Core team continues to improve base package
• Many (>400) others contribute “packages”
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate generalizations of the linear model
• Multivariate-structural extensions
Why R? Graphics
• Sample graphics taken from
• http:personality-project.org/r/
• showing what can be done by an amateur
• http://addictedtor.free.fr/graphiques/
• showing some most impressive graphs
m-1.0-1.0M-1.0M
-1.0
-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
m-1.0-1.0M-1.0M-1
.0-0.5-0.5M-0.5M
-0.5
0.00.0M0.0M0.0
0.50.5M0.5M0.5
1.01.0M1.0M1.0
factor 1Mfactor 1M
factor 1
factor 2Mfactor 2M
facto
r 2
Two dimensions of affectMTwo dimensions of affectM
Two dimensions of affect
sluggishMsluggishM
sluggish
depressedMdepressedM
depressed
satisfiedMsatisfiedM
satisfied
relaxedMrelaxedM
relaxed
blueMblueM
blue
intenseMintenseM
intense
scaredMscaredM
scared
enthusiasticMenthusiasticM
enthusiastic
sadMsadM
sad
full_of_pepMfull_of_pepM
full_of_pep
unhappyMunhappyM
unhappy
livelyMlivelyM
lively
tiredMtiredM
tired
dullMdullM
dull
at_easeMat_easeM
at_ease
M M
guiltyMguiltyM
guilty
excitedMexcitedM
excited
gloomyMgloomyM
gloomy
nervousMnervousM
nervous
grouchyMgrouchyM
grouchy
calmMcalmM
calm
drowsyMdrowsyM
drowsy
alertMalertM
alert
surprisedMsurprisedM
surprised
energeticMenergeticM
energetic
sorryMsorryM
sorry
sleepyMsleepyM
sleepy
pleasedMpleasedM
pleased
happyMhappyM
happy
distressedMdistressedM
distressed
contentMcontentM
content
tenseMtenseM
tense
ashamedMashamedM
ashamed
anxiousManxiousM
anxious
cheerfulMcheerfulM
cheerful
Standard Plots of factor loadings
m-0.4-0.4M-0.4M
-0.4
-0.2-0.2M-0.2M
-0.2
0.00.0M0.0M
0.0
0.20.2M0.2M
0.2
0.40.4M0.4M
0.4
0.60.6M0.6M
0.6
0.80.8M0.8M
0.8
1.01.0M1.0M
1.0
m-0.4-0.4M-0.4M-0
.4-0.2-0.2M-0.2M
-0.2
0.00.0M0.0M0
.00.20.2M0.2M
0.2
0.40.4M0.4M0
.40.60.6M0.6M
0.6
0.80.8M0.8M0
.81.01.0M1.0M
1.0
correlations with happyMcorrelations with happyM
correlations with happy
correlations with sadMcorrelations with sadM
co
rre
latio
ns w
ith
sa
d
correlations with happy and sadMcorrelations with happy and sadM
correlations with happy and sad
sluggishMsluggishM
sluggish
depressedMdepressedM
depressed
satisfiedMsatisfiedM
satisfied
relaxedMrelaxedM
relaxed
blueMblueM
blue
intenseMintenseM
intense
scaredMscaredM
scared
sadMsadM
sad
full_of_pepMfull_of_pepM
full_of_pep
unhappyMunhappyM
unhappy
livelyMlivelyM
lively
tiredMtiredM
tired
dullMdullM
dull
at_easeMat_easeM
at_ease
M M
guiltyMguiltyM
guilty
excitedMexcitedM
excited
gloomyMgloomyM
gloomy
nervousMnervousM
nervous
grouchyMgrouchyM
grouchy
calmMcalmM
calm
drowsyMdrowsyM
drowsy
alertMalertM
alert
surprisedMsurprisedM
surprised
energeticMenergeticM
energetic
sorryMsorryM
sorry
sleepyMsleepyM
sleepy
pleasedMpleasedM
pleased
happyMhappyM
happy
distressedMdistressedM
distressed
contentMcontentM
content
tenseMtenseM
tense
ashamedMashamedM
ashamed
anxiousManxiousM
anxious
cheerfulMcheerfulM
cheerful
Data points can be dynamically identified
-1.0
/Users/Bill/Users/Bill
-0.5
/Users/Bill
0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with 0 degrees
r w
ith 9
0 d
egre
es
Simulated correlations 90 degrees
/Users/Bill
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with happy
r w
ith n
erv
ous
Observed correlations with happy and nervous
/Users/Bill
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with 0 degrees
r w
ith 1
20 d
egre
es
Simulated correlations 120 degrees
/Users/Bill
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with happy
r w
ith s
ad
Observed correlations with happy and sad
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with 0 degrees
r w
ith 1
80 d
egre
es
/Users/Bill
Simulated correlations 180 degrees
/Users/Bill
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
r with active
r w
ith inactive
Observed correlations with active and inactive
Multi-panel graphs can be labeled separately and organized vertically or
horizontally
Simulated data can be generated to fit normal,
rectangular, binomial, poissen, exponential, etc.
distributions
epiE
0 2 4 6 8 10 12
0.85 0.80
0 1 2 3 4 5 6 7
-0.22
510
15
20
-0.180
24
68
10
12
~
epiS
0.43 -0.05 -0.22
~
epiImp
-0.24
02
46
8
-0.07
01
23
45
67
~
epilie
-0.25
5 10 15 20 0
~
2 4 6 8
~
0 5 10 15 20
05
10
15
20epiNeur
Scatter Plot Matrices can show smoothed fits
epiE
0 2 4 6 8 10 12
0.85 0.80
~
0 1 2 3 4 5 6 7
-0.22
510
15
20
-0.18
02
46
810
12
epiS
0.43 -0.052 -0.22
~
epiImp
-0.24
02
46
8
-0.074
01
23
45
67
~
epilie
-0.25
5 10 15 20 0 2 4 6 8
~
0 5 10 15 20
05
10
15
20epiNeur
Can scale font size of correlations by absolute size of r
●
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Movie Study 1
Positive Affect
Nega
tive
Affe
ct
Sad
Threat Neutral
Happy
●
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Movie Study 2
Positive Affect
Nega
tive
Affe
ct
Sad
Threat
Neutral
Happy
Error bars on two dimensions
●
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
PA x NA
Positive Affect
Nega
tive
Affe
ct Sad
ThreatNeutral
Happy
●
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
PA x NA
Positive Affect
Nega
tive
Affe
ct
Sad
Threat
Neutral
Happy
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.0
1.5
PA x EA
Postive Affect
Ener
getic
Aro
usal
Sad
ThreatNeutral
Happy
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.0
1.5
PA x EA
Postive Affect
Ener
getic
Aro
usal
Sad
Threat
Neutral
Happy
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.0
1.5
NA x TA
Negative Affect
Tens
e Ar
ousa
l SadThreat
NeutralHappy
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.0
1.5
NA x TA
Negative Affect
Tens
e Ar
ousa
l SadThreat
Neutral
Happy
Study 2 Study 3
Presentation graphicsscale to fit page
and produce output for pdf presentations
Window or page size is controllable
Graphic output can be taken from multiple programs (e.g. Very Simple Structure, factor analysis
5 10 15
46
81
01
2
x1
y1
5 10 15
46
81
01
2
x2
y2
5 10 15
46
81
01
2
x3
y3
5 10 15
46
81
01
2
~
x4
y4
Anscombe's 4 Regression data sets
Built-in data sets provide useful demonstrations of stats
Mapping data (GIS) may be combined with descriptive data
Many GIS map files are available for download from ESRI
GIS maps detail regional boundaries
US counties
Counties of California
~
Median Household Income for 2000
2000 median household income for zip codes
~
Median Household Income for 2000
Combine income data from 2000 census with zipcode map of San Diego County
100 105 110 115 120 125 130
-50
510
Borneo and its surrounds
longitude
latitude
GIS files of Borneo can show country boundaries (e.g., Malaysia, Brunei, Indonesia)
100 105 110 115 120 125 130
-50
510
Borneo and its surrounds
longitude
latitude
Public access GIS files include rivers and roads
Even more graphics
• Taken from a collection of R demonstrations and graphics
• http://addictedtor.free.fr/graphiques/
c(-1, 1)
~
PCA 5 varsprincomp(x = data, cor = cor)
x1
Fertility
Agriculture
Examination
Education Catholic
(1-3) 60%
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
0.0
0.5
1.0
1.5
c(-1, 1)
Clustering 4 groups
80 60 40 20 0
48Groups
2
1
16
28
~
V. De Geneve
c(-1, 1)
c(0
, 1)
Factor 1 [41%]
c(-1, 1)
c(0
, 1)
Factor 3 [19%]
Principal components and clustering of sources of variance in USA arrest data
~
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
Mixture models
1
~
2 3 4 5 6 7 8 9 10
0
2
4
6
Notched Boxplots
Notched Boxplots show confidence regions
~
Scatter Plot Matrix
Sepal
Length
Sepal
Width
~
Petal
Length
setosa
Sepal
Length
Sepal
Width
~
Petal
Length
versicolor
Sepal
Length
~
Sepal
Width
Petal
Length
virginica
Three
Varieties
of
Iris
Multipanel graphs
~
Height (inches)
De
nsity
60 65 70 75
0.00
0.05
0.10
0.15
0.20
~
Bass 2 Bass 1
60 65 70 75
Tenor 2
~
Tenor 1
Alto 2
60 65 70 75
Alto 1
~
Soprano 2
60 65 70 75
0.00
0.05
0.10
0.15
0.20
Soprano 1
Histograms and fitted distributions
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Combine scatter plot with histograms
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate generalizations of the linear model
• Multivariate-structural extensions
Data ManipulationData Entry
• from console
• from clipboard (copied from other programs)
• from file (text files, csv, SPSS, Excel, MySQL)
• from the web
Data Manipulation: Data Structures
• Data types: integer, real, logical, character, string
• Vectors of any data type
• Matrices of any data type
• Data Frames (similar to matrix of mixed type)
• Lists of any mixture of types
• All operations are functions and the returned values may be used in any data structure (e.g., as an element of a data frame or of a list)
Data Manipulation
• standard arithmetic and logical operations
• matrix operations including transpose, inner product, outer product, diagonal, trace, invert
• searching, sorting, merging
• data cleaning by logical commands
Why R?
• Graphics for data exploration and interpretation
• Data manipulation including statistics as data
• Statistical analysis
• Standard univariate and multivariate generalizations of the linear model
• Multivariate-structural extensions
Standard Statistical packages in R
• Descriptive and exploratory statistics
• The general linear model and its special cases
• t-test, ANOVA, MANOVA, regression, logistic regression, cox models, etc.
• multilevel models (mixed models, hierarchical models)
• time series, econometrics
• circular statistics, environmental-geographical statistics
Psychometric packages• Structural Equation Modeling
• Rasch Modeling of one parameter IRT
• Factor and Principal Component Analysis
• Rotations (Orthogonal: Varimax, Oblique: quartimax, quartimin, Promax, etc.)
• singular value decomposition
• eigen vector - eigen value decomposition
• Multidimensional scaling
R is extensible
• Functions can be defined easily and then stored for later use.
• Packages of functions are written by “all of us” to solve particular problems
• e.g. sem, rasch, rotation, lattice graphics
• Packages can be developed and tailored for a specific lab or problem area, or can be shared through the CRAN repository of (currently) > 400 packages
Programming in R-- personal examples
• Very Simple Structure
• 6 weeks in Fortran for a mainframe
• 3 weeks in Pascal for a Mac
• 2 days in R for Mac/Unix/Windows
• Schmid Leiman decomposition and estimation of coefficient omega
• 1 day
Popular misconceptions
• R is hard to learn
• R is not user friendly
• I can’t figure R out
• R is free -- it can’t be very good
Popular Misconceptions• R is hard to learn
• With a brief web based tutorial, 2nd and 3rd year undergraduates in psychological methods and personality research courses were using R for descriptive and inferential statistics and producing publication quality graphics
• R is easy to learn, hard to master
• R-help newsgroup is very supportive
• Multiple web based and pdf tutorials
Popular Misconceptions• R is not user friendly
• Does not have a GUI, but those are being developed
• Help and examples are embedded within program,
• ? function produces help with examples for any function
• R-help newsgroup is very supportive
Popular Misconceptions
• R is free, it can not be any good
• Developed by some of the best statisticians around
• vetted by all users, bugs when they exist, are rapidly fixed
• R community provides excellent and timely statistical and programming help (if somewhat acerbic to those who have not done their homework)
R: an international collaboratory
• Most appropriate for an International Society to be aware of the power of international collaboration to produce cutting edge software
R: statistics for all of usR: an international statistical collaboratory
Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005
William Revelle, Northwestern Universitypersonality-project.org/r/personality-project.org/r.short.html
Recommended