Upload
cutter
View
74
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Summarizing Data. Numeric Methods. Rcmdr. Features for loading, viewing and analyzing data Help system Packages. Data in R. Several formats: vectors, arrays, matrices, lists, data.frames - PowerPoint PPT Presentation
Citation preview
Summarizing Data
Numeric Methods
Rcmdr• Features for loading, viewing and
analyzing data• Help system• Packages
Data in R• Several formats: vectors, arrays,
matrices, lists, data.frames• Generally we use data.frames as
they have the advantage of letting us store different kinds of data and linking them by row.
• Rcmdr uses data.frames
Referencing Data.frames• R allows you to refer to rows,
columns and individual cells in a data frame in multiple ways
• Every cell has a row and column number that identifies it (like Excel)
• Every cell is the intersection of a named row and a named column
Types of Data in R• Numeric – integer and decimal• Categorical – factors• Ranks – ordered factor• Logical (True/False)• Character
Data sets• Darl and Pedernales points from
Fort Hood archaeological surveys• Data on village and house sizes
among California Indian tribes from an article by Sherburne Cook and Robert Heizer
Darl Pedernales
> head(DartPoints) Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.836-3321 Darl 41CV1023 12/58 58 12 36.0 17.1 4.036-3520 Darl 41CV0495 16/62 62 16 32.4 14.5 5.235-2382 Darl 41CV0611 22/62 62 22 31.2 15.6 5.140-0847 Darl 41CV1287 05/48 48 5 33.6 15.8 5.135-2959 Darl 41CV0235 21/63 63 21 41.8 16.8 4.1
> tail(DartPoints) Name TARL QUAD East North Length Width
Thick38-0098 Pedernales 41BL0416 39/45 45 39 74.0 34.0
6.635-2951 Pedernales 41CV0235 21/63 63 21 64.5 28.5
8.235-0173 Pedernales 41CV0869 22/66 66 22 78.3 28.1
8.536-4266 Pedernales 41CV0240 15/65 65 15 64.1 27.2
12.041-0239 Pedernales 41CV0493 16/62 62 16 67.2 27.1
12.035-2855 Pedernales 41CV0843 24/65 65 24 49.3 19.5
7.5
> str(DartPoints)'data.frame': 55 obs. of 8 variables: $ Name : Factor w/ 2 levels "Darl","Pedernales": 1 1 1 1 1 1 1 1 1 1 ... $ TARL : Factor w/ 43 levels "41BL0183","41BL0205",..: 13 34 18 21 ... $ QUAD : Factor w/ 38 levels "05/48","08/63",..: 30 9 17 28 1 26 15 ... $ East : num 62 58 62 62 48 63 33 63 59 63 ... $ North : num 24 12 16 22 5 21 16 17 26 20 ... $ Length: num 34.5 36 32.4 31.2 33.6 41.8 33.5 32 42.8 37.5 ... $ Width : num 15.9 17.1 14.5 15.6 15.8 16.8 16.6 16 15.8 16.3 ... $ Thick : num 4.8 4 5.2 5.1 5.1 4.1 4.9 5.4 5.8 6.1 ... - attr(*, "na.action")=Class 'omit' Named int [1:2] 39 53 .. ..- attr(*, "names")= chr [1:2] "35-2650" "35-2384“
> attributes(DartPoints)$names[1] "Name" "TARL" "QUAD" "East" "North" "Length" "Width" "Thick"
$row.names [1] "35-3026" "36-3321" "36-3520" "35-2382" "40-0847" "35-2959" [7] "41-0257" "36-3619" "41-0322" "35-2921" "36-3036" "35-2905" [13] "35-2866" "36-3487" "36-4247" "35-2928" "35-2871" "36-3898" [19] "35-2946" "38-0736" "35-2325" "35-0164" "41-0323" "35-3043" [25] "35-2004" "35-2960" "41-0237" "44-0643" "43-0110" "36-3549" [31] "41-0008" "36-4320" "44-1315M" "35-2901" "41-0220" "35-2873" [37] "47-0041" "36-3879" "41-0054" "50-0092" "44-1492M" "36-3880" [43] "35-2875" "36-3081" "36-3897" "44-1253M" "36-3229" "41-0058" [49] "35-2391" "38-0098" "35-2951" "35-0173" "36-4266" "41-0239" [55] "35-2855"
$class[1] "data.frame"
> DartPoints[1,] Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8> DartPoints["35-3026",] Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8> DartPoints[,6] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints[,"Length"] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints$Length [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints[1,6][1] 34.5
> head(CAIndians) Region Tribe Language AreaHouse FamilySize FpHouse PpHouse AreapPer1 1 Yurok Algonkin 439 7.5 1 7.5 58.52 2 Wiyot Algonkin 254 7.5 1 7.5 33.83 3 Karok Hokan NA 7.5 1 7.5 NA4 4 Hupa Athabaskan 400 7.0 1 7.0 57.15 5 Chilula Athabaskan NA 7.5 1 7.5 NA6 6 Shasta Hokan 264 7.0 1 7.0 33.0
HpVillage PpVillage AreapVil AreaVillage VpHouse VpPer PctFloor1 7.8 60 3434 25450 3263 424 13.52 7.6 57 1930 28400 3738 498 6.83 4.1 31 NA NA NA NA NA4 10.9 76 4360 NA NA NA NA5 7.0 52 NA NA NA NA NA6 6.0 48 1584 18950 3158 394 8.4
> str(CAIndians)'data.frame': 30 obs. of 15 variables: $ Region : int 1 2 3 4 5 6 7 8 9 10 ... $ Tribe : Factor w/ 30 levels "Achomawi","Athabascans",..: 30 27 8 7 5... $ Language : Factor w/ 6 levels "Algonkin","Athabaskan",..: 1 1 3 2 2 3 3 ... $ AreaHouse : int 439 254 NA 400 NA 264 110 118 100 125 ... $ FamilySize : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ FpHouse : num 1 1 1 1 1 1 1 1 1 1 ... $ PpHouse : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ AreapPer : num 58.5 33.8 NA 57.1 NA 33 18.3 19.6 16.7 20.8 ... $ HpVillage : num 7.8 7.6 4.1 10.9 7 6 5.3 5.4 3.6 5 ... $ PpVillage : int 60 57 31 76 52 48 32 32 22 30 ... $ AreapVil : int 3434 1930 NA 4360 NA 1584 583 637 360 625 ... $ AreaVillage: int 25450 28400 NA NA NA 18950 14000 27100 61500 6390 ... $ VpHouse : int 3263 3738 NA NA NA 3158 2641 5019 17084 1278 ... $ VpPer : int 424 498 NA NA NA 394 438 847 2795 214 ... $ PctFloor : num 13.5 6.8 NA NA NA 8.4 4.2 2.4 0.6 9.8 ...
Central Tendency• Mean (Average) = Sum/Number
– Dichotomous data – percentage present
• Median = Middle value• Mode = Predominant value
> mean(DartPoints$Length)[1] 48.64> median(DartPoints$Length)[1] 47.1> mean(CAIndians$AreaHouse)[1] NA> mean(CAIndians$AreaHouse, na.rm=TRUE)[1] 299.4815> median(CAIndians$AreaHouse)[1] NA> median(CAIndians$AreaHouse, na.rm=TRUE)[1] 129> mean(DartPoints[,6:8]) Length Width Thick 48.640000 22.052727 7.283636 > mean(DartPoints[DartPoints$Name=="Darl",6:8]) Length Width Thick 40.574074 18.003704 5.981481 > mean(DartPoints[DartPoints$Name=="Pedernales",6:8]) Length Width Thick 56.417857 25.957143 8.539286
Dispersion• Range (max – min)• Standard Deviation, Variance
(Sample vs. Population)• Coefficient of Variation =
StDev/Mean * 100• Quartiles and the Interquartile
Range
> range(DartPoints$Length)[1] 31.2 84.0> diff(range(DartPoints$Length))[1] 52.8> sd(DartPoints$Length)[1] 12.22144> var(DartPoints$Length)[1] 149.3636> sd(DartPoints$Length)/mean(DartPoints$Length)*100[1] 25.12631> quantile(DartPoints$Length) 0% 25% 50% 75% 100% 31.20 40.90 47.10 55.65 84.00 > IQR(DartPoints$Length)[1] 14.75> diff(range(CAIndians$AreaHouse, na.rm=TRUE))[1] 1175> sd(CAIndians$AreaHouse, na.rm=TRUE)[1] 339.4273> var(CAIndians$AreaHouse, na.rm=TRUE)[1] 115210.9> quantile(CAIndians$AreaHouse, na.rm=TRUE) 0% 25% 50% 75% 100% 75.0 110.5 129.0 310.0 1250.0
Shape• Symmetry, Skewness
– Normal = 0, Positive or Negative indicates tail in that direction
• Peaked vs Flat, Kurtosis– Normal = 0, Positive – more clustered
(peaked) than normal, Negative – more spread (flatter) than normal
> library(e1071)Loading required package: class> skewness(DartPoints$Length)[1] 0.7749526> kurtosis(DartPoints$Length)[1] 0.12126
> skewness(CAIndians$AreaHouse, na.rm=TRUE)[1] 1.708035> kurtosis(CAIndians$AreaHouse, na.rm=TRUE)[1] 1.498035
Descriptive Stats• summary() – in base R• numSummary() – in Rcmdr• describe() – in psych• describe() – in prettyR• stat.desc() – pastecs
> summary(DartPoints) Name TARL QUAD East North Darl :27 41CV0235: 4 21/63 : 4 Min. :33.00 Min. : 5.00 Pedernales:28 41CV0859: 3 14/62 : 3 1st Qu.:55.00 1st Qu.:14.50 41CV1092: 3 16/62 : 3 Median :62.00 Median :20.00 41BL0205: 2 20/63 : 3 Mean :58.24 Mean :19.02 41CV0132: 2 22/66 : 3 3rd Qu.:63.50 3rd Qu.:23.00 41CV0493: 2 24/66 : 3 Max. :70.00 Max. :39.00 (Other) :39 (Other):36 Length Width Thick Min. :31.20 Min. :14.50 Min. : 4.000 1st Qu.:40.90 1st Qu.:16.95 1st Qu.: 5.850 Median :47.10 Median :22.00 Median : 7.200 Mean :48.64 Mean :22.05 Mean : 7.284 3rd Qu.:55.65 3rd Qu.:26.95 3rd Qu.: 8.050 Max. :84.00 Max. :34.00 Max. :12.000
> numSummary(DartPoints[,6:8]) mean sd 0% 25% 50% 75% 100% nLength 48.640000 12.221438 31.2 40.90 47.1 55.65 84 55Width 22.052727 5.194579 14.5 16.95 22.0 26.95 34 55Thick 7.283636 1.891870 4.0 5.85 7.2 8.05 12 55
> library(psych)> describe(DartPoints[,6:8]) var n mean sd median trimmed mad min max range skew kurtosis seLength 1 55 48.64 12.22 47.1 47.63 12.16 31.2 84 52.8 0.77 0.38 1.65Width 2 55 22.05 5.19 22.0 21.85 7.41 14.5 34 19.5 0.24 -1.16 0.70Thick 3 55 7.28 1.89 7.2 7.13 1.48 4.0 12 8.0 0.69 0.50 0.26
> detach(package:psych)> library(prettyR)> describe(DartPoints[,6:8])Description of DartPoints[, 6:8]
Numeric mean median var sd valid.nLength 48.64 47.1 149.4 12.22 55Width 22.05 22 26.98 5.195 55Thick 7.284 7.2 3.579 1.892 55
> library(pastecs)> stat.desc(DartPoints[,6:8]) Length Width Thicknbr.val 55.0000000 55.0000000 55.0000000nbr.null 0.0000000 0.0000000 0.0000000nbr.na 0.0000000 0.0000000 0.0000000min 31.2000000 14.5000000 4.0000000max 84.0000000 34.0000000 12.0000000range 52.8000000 19.5000000 8.0000000sum 2675.2000000 1212.9000000 400.6000000median 47.1000000 22.0000000 7.2000000mean 48.6400000 22.0527273 7.2836364SE.mean 1.6479384 0.7004369 0.2550997CI.mean.0.95 3.3039176 1.4042914 0.5114441var 149.3635556 26.9836498 3.5791717std.dev 12.2214384 5.1945789 1.8918699coef.var 0.2512631 0.2355527 0.2597425
Decimals• The various summaries of statistics
provide limited ways to round or modify the output:
• Options digits= and scipen= can be set before running the summary
• Wrapping the function in round() works for some.
> stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick
. . . . .skewness 7.749526e-01 0.23775656 0.68608894skew.2SE 1.204307e+00 0.36948312 1.06620943kurtosis 1.212600e-01 -1.22700641 0.23109312kurt.2SE 9.570531e-02 -0.96842355 0.18239189normtest.W 9.435694e-01 0.92900685 0.94970998normtest.p 1.207084e-02 0.00300526 0.02233218> op <- options(digits=3, scipen=100)> stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick
. . . . .skewness 0.7750 0.23776 0.6861skew.2SE 1.2043 0.36948 1.0662kurtosis 0.1213 -1.22701 0.2311kurt.2SE 0.0957 -0.96842 0.1824normtest.W 0.9436 0.92901 0.9497normtest.p 0.0121 0.00301 0.0223> options(op)
Publishable Tables• Much of the focus in producing
publishable tables in R is on LaTex• Most anthropologists are more
familiar with html• xtable() provides both if there is an
xtable method for your output
Using xtable()• xtable(function-output) produces a
LaTex version of the table• print(xtable(function-output),
type=“html”) converts to html• Appending file=“mytable.html”)
will write the output to a file
> library(psych)> print(xtable(describe(DartPoints[,6:8])), type="html")<!-- html table generated in R 2.13.1 by xtable 1.5-6 package --><!-- Mon Sep 05 11:54:20 2011 --><TABLE border=1><TR> <TH> </TH> <TH> var </TH> <TH> n </TH> <TH> mean </TH> <TH> sd </TH> <TH> median </TH> <TH> trimmed </TH> <TH> mad </TH> <TH> min </TH> <TH> max </TH> <TH> range </TH> <TH> skew </TH> <TH> kurtosis </TH> <TH> se </TH> </TR> . . . . . </TABLE>
Use print(xtable(x), type=“html”) ; select the html commands (<TABLE> to </TABLE> , copy, and paste into Excel orprint(xtable(x), type=“html”, file=“filename.html”) and insert the file into Excel or Word
SummaryMixed Data
Customize Groups digits= round() xtable()
summary Yes No by() Yes No YesnumSummary - Rcmdr
No Yes Yes print()1 Yes1 Yes1
describe - psych
No Yes Yes print() No Yes
describe - prettyR
Yes Yes by() print() Yes No
stat.desc - pastecs
No Some by() No No Yes
1 Extract table part of results: print(numSummary(x)$table, digits=4);round(print(numSummary(x)$table, 3)print(xtable(numSummary(x)$table), type="html")