Summarizing Data

Summarizing Data

Numeric Methods

Rcmdr• Features for loading, viewing and

analyzing data• Help system• Packages

Data in R• Several formats: vectors, arrays,

matrices, lists, data.frames• Generally we use data.frames as

they have the advantage of letting us store different kinds of data and linking them by row.

• Rcmdr uses data.frames

Referencing Data.frames• R allows you to refer to rows,

columns and individual cells in a data frame in multiple ways

• Every cell has a row and column number that identifies it (like Excel)

• Every cell is the intersection of a named row and a named column

Types of Data in R• Numeric – integer and decimal• Categorical – factors• Ranks – ordered factor• Logical (True/False)• Character

Data sets• Darl and Pedernales points from

Fort Hood archaeological surveys• Data on village and house sizes

among California Indian tribes from an article by Sherburne Cook and Robert Heizer

Darl Pedernales

> head(DartPoints) Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.836-3321 Darl 41CV1023 12/58 58 12 36.0 17.1 4.036-3520 Darl 41CV0495 16/62 62 16 32.4 14.5 5.235-2382 Darl 41CV0611 22/62 62 22 31.2 15.6 5.140-0847 Darl 41CV1287 05/48 48 5 33.6 15.8 5.135-2959 Darl 41CV0235 21/63 63 21 41.8 16.8 4.1

> tail(DartPoints) Name TARL QUAD East North Length Width

Thick38-0098 Pedernales 41BL0416 39/45 45 39 74.0 34.0

6.635-2951 Pedernales 41CV0235 21/63 63 21 64.5 28.5

8.235-0173 Pedernales 41CV0869 22/66 66 22 78.3 28.1

8.536-4266 Pedernales 41CV0240 15/65 65 15 64.1 27.2

12.041-0239 Pedernales 41CV0493 16/62 62 16 67.2 27.1

12.035-2855 Pedernales 41CV0843 24/65 65 24 49.3 19.5

7.5

> str(DartPoints)'data.frame': 55 obs. of 8 variables: $ Name : Factor w/ 2 levels "Darl","Pedernales": 1 1 1 1 1 1 1 1 1 1 ... $ TARL : Factor w/ 43 levels "41BL0183","41BL0205",..: 13 34 18 21 ... $ QUAD : Factor w/ 38 levels "05/48","08/63",..: 30 9 17 28 1 26 15 ... $ East : num 62 58 62 62 48 63 33 63 59 63 ... $ North : num 24 12 16 22 5 21 16 17 26 20 ... $ Length: num 34.5 36 32.4 31.2 33.6 41.8 33.5 32 42.8 37.5 ... $ Width : num 15.9 17.1 14.5 15.6 15.8 16.8 16.6 16 15.8 16.3 ... $ Thick : num 4.8 4 5.2 5.1 5.1 4.1 4.9 5.4 5.8 6.1 ... - attr(*, "na.action")=Class 'omit' Named int [1:2] 39 53 .. ..- attr(*, "names")= chr [1:2] "35-2650" "35-2384“

> attributes(DartPoints)$names[1] "Name" "TARL" "QUAD" "East" "North" "Length" "Width" "Thick"

$row.names [1] "35-3026" "36-3321" "36-3520" "35-2382" "40-0847" "35-2959" [7] "41-0257" "36-3619" "41-0322" "35-2921" "36-3036" "35-2905" [13] "35-2866" "36-3487" "36-4247" "35-2928" "35-2871" "36-3898" [19] "35-2946" "38-0736" "35-2325" "35-0164" "41-0323" "35-3043" [25] "35-2004" "35-2960" "41-0237" "44-0643" "43-0110" "36-3549" [31] "41-0008" "36-4320" "44-1315M" "35-2901" "41-0220" "35-2873" [37] "47-0041" "36-3879" "41-0054" "50-0092" "44-1492M" "36-3880" [43] "35-2875" "36-3081" "36-3897" "44-1253M" "36-3229" "41-0058" [49] "35-2391" "38-0098" "35-2951" "35-0173" "36-4266" "41-0239" [55] "35-2855"

$class[1] "data.frame"

> DartPoints[1,] Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8> DartPoints["35-3026",] Name TARL QUAD East North Length Width Thick35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8> DartPoints[,6] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints[,"Length"] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints$Length [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1[16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7[31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0[46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3> DartPoints[1,6][1] 34.5

> head(CAIndians) Region Tribe Language AreaHouse FamilySize FpHouse PpHouse AreapPer1 1 Yurok Algonkin 439 7.5 1 7.5 58.52 2 Wiyot Algonkin 254 7.5 1 7.5 33.83 3 Karok Hokan NA 7.5 1 7.5 NA4 4 Hupa Athabaskan 400 7.0 1 7.0 57.15 5 Chilula Athabaskan NA 7.5 1 7.5 NA6 6 Shasta Hokan 264 7.0 1 7.0 33.0

HpVillage PpVillage AreapVil AreaVillage VpHouse VpPer PctFloor1 7.8 60 3434 25450 3263 424 13.52 7.6 57 1930 28400 3738 498 6.83 4.1 31 NA NA NA NA NA4 10.9 76 4360 NA NA NA NA5 7.0 52 NA NA NA NA NA6 6.0 48 1584 18950 3158 394 8.4

> str(CAIndians)'data.frame': 30 obs. of 15 variables: $ Region : int 1 2 3 4 5 6 7 8 9 10 ... $ Tribe : Factor w/ 30 levels "Achomawi","Athabascans",..: 30 27 8 7 5... $ Language : Factor w/ 6 levels "Algonkin","Athabaskan",..: 1 1 3 2 2 3 3 ... $ AreaHouse : int 439 254 NA 400 NA 264 110 118 100 125 ... $ FamilySize : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ FpHouse : num 1 1 1 1 1 1 1 1 1 1 ... $ PpHouse : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ AreapPer : num 58.5 33.8 NA 57.1 NA 33 18.3 19.6 16.7 20.8 ... $ HpVillage : num 7.8 7.6 4.1 10.9 7 6 5.3 5.4 3.6 5 ... $ PpVillage : int 60 57 31 76 52 48 32 32 22 30 ... $ AreapVil : int 3434 1930 NA 4360 NA 1584 583 637 360 625 ... $ AreaVillage: int 25450 28400 NA NA NA 18950 14000 27100 61500 6390 ... $ VpHouse : int 3263 3738 NA NA NA 3158 2641 5019 17084 1278 ... $ VpPer : int 424 498 NA NA NA 394 438 847 2795 214 ... $ PctFloor : num 13.5 6.8 NA NA NA 8.4 4.2 2.4 0.6 9.8 ...

Central Tendency• Mean (Average) = Sum/Number

– Dichotomous data – percentage present

• Median = Middle value• Mode = Predominant value

> mean(DartPoints$Length)[1] 48.64> median(DartPoints$Length)[1] 47.1> mean(CAIndians$AreaHouse)[1] NA> mean(CAIndians$AreaHouse, na.rm=TRUE)[1] 299.4815> median(CAIndians$AreaHouse)[1] NA> median(CAIndians$AreaHouse, na.rm=TRUE)[1] 129> mean(DartPoints[,6:8]) Length Width Thick 48.640000 22.052727 7.283636 > mean(DartPoints[DartPoints$Name=="Darl",6:8]) Length Width Thick 40.574074 18.003704 5.981481 > mean(DartPoints[DartPoints$Name=="Pedernales",6:8]) Length Width Thick 56.417857 25.957143 8.539286

Dispersion• Range (max – min)• Standard Deviation, Variance

(Sample vs. Population)• Coefficient of Variation =

StDev/Mean * 100• Quartiles and the Interquartile

Range

> range(DartPoints$Length)[1] 31.2 84.0> diff(range(DartPoints$Length))[1] 52.8> sd(DartPoints$Length)[1] 12.22144> var(DartPoints$Length)[1] 149.3636> sd(DartPoints$Length)/mean(DartPoints$Length)*100[1] 25.12631> quantile(DartPoints$Length) 0% 25% 50% 75% 100% 31.20 40.90 47.10 55.65 84.00 > IQR(DartPoints$Length)[1] 14.75> diff(range(CAIndians$AreaHouse, na.rm=TRUE))[1] 1175> sd(CAIndians$AreaHouse, na.rm=TRUE)[1] 339.4273> var(CAIndians$AreaHouse, na.rm=TRUE)[1] 115210.9> quantile(CAIndians$AreaHouse, na.rm=TRUE) 0% 25% 50% 75% 100% 75.0 110.5 129.0 310.0 1250.0

Shape• Symmetry, Skewness

– Normal = 0, Positive or Negative indicates tail in that direction

• Peaked vs Flat, Kurtosis– Normal = 0, Positive – more clustered

(peaked) than normal, Negative – more spread (flatter) than normal

> library(e1071)Loading required package: class> skewness(DartPoints$Length)[1] 0.7749526> kurtosis(DartPoints$Length)[1] 0.12126

> skewness(CAIndians$AreaHouse, na.rm=TRUE)[1] 1.708035> kurtosis(CAIndians$AreaHouse, na.rm=TRUE)[1] 1.498035

Descriptive Stats• summary() – in base R• numSummary() – in Rcmdr• describe() – in psych• describe() – in prettyR• stat.desc() – pastecs

> summary(DartPoints) Name TARL QUAD East North Darl :27 41CV0235: 4 21/63 : 4 Min. :33.00 Min. : 5.00 Pedernales:28 41CV0859: 3 14/62 : 3 1st Qu.:55.00 1st Qu.:14.50 41CV1092: 3 16/62 : 3 Median :62.00 Median :20.00 41BL0205: 2 20/63 : 3 Mean :58.24 Mean :19.02 41CV0132: 2 22/66 : 3 3rd Qu.:63.50 3rd Qu.:23.00 41CV0493: 2 24/66 : 3 Max. :70.00 Max. :39.00 (Other) :39 (Other):36 Length Width Thick Min. :31.20 Min. :14.50 Min. : 4.000 1st Qu.:40.90 1st Qu.:16.95 1st Qu.: 5.850 Median :47.10 Median :22.00 Median : 7.200 Mean :48.64 Mean :22.05 Mean : 7.284 3rd Qu.:55.65 3rd Qu.:26.95 3rd Qu.: 8.050 Max. :84.00 Max. :34.00 Max. :12.000

> numSummary(DartPoints[,6:8]) mean sd 0% 25% 50% 75% 100% nLength 48.640000 12.221438 31.2 40.90 47.1 55.65 84 55Width 22.052727 5.194579 14.5 16.95 22.0 26.95 34 55Thick 7.283636 1.891870 4.0 5.85 7.2 8.05 12 55

> library(psych)> describe(DartPoints[,6:8]) var n mean sd median trimmed mad min max range skew kurtosis seLength 1 55 48.64 12.22 47.1 47.63 12.16 31.2 84 52.8 0.77 0.38 1.65Width 2 55 22.05 5.19 22.0 21.85 7.41 14.5 34 19.5 0.24 -1.16 0.70Thick 3 55 7.28 1.89 7.2 7.13 1.48 4.0 12 8.0 0.69 0.50 0.26

> detach(package:psych)> library(prettyR)> describe(DartPoints[,6:8])Description of DartPoints[, 6:8]

Numeric mean median var sd valid.nLength 48.64 47.1 149.4 12.22 55Width 22.05 22 26.98 5.195 55Thick 7.284 7.2 3.579 1.892 55

> library(pastecs)> stat.desc(DartPoints[,6:8]) Length Width Thicknbr.val 55.0000000 55.0000000 55.0000000nbr.null 0.0000000 0.0000000 0.0000000nbr.na 0.0000000 0.0000000 0.0000000min 31.2000000 14.5000000 4.0000000max 84.0000000 34.0000000 12.0000000range 52.8000000 19.5000000 8.0000000sum 2675.2000000 1212.9000000 400.6000000median 47.1000000 22.0000000 7.2000000mean 48.6400000 22.0527273 7.2836364SE.mean 1.6479384 0.7004369 0.2550997CI.mean.0.95 3.3039176 1.4042914 0.5114441var 149.3635556 26.9836498 3.5791717std.dev 12.2214384 5.1945789 1.8918699coef.var 0.2512631 0.2355527 0.2597425

Decimals• The various summaries of statistics

provide limited ways to round or modify the output:

• Options digits= and scipen= can be set before running the summary

• Wrapping the function in round() works for some.

> stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick

. . . . .skewness 7.749526e-01 0.23775656 0.68608894skew.2SE 1.204307e+00 0.36948312 1.06620943kurtosis 1.212600e-01 -1.22700641 0.23109312kurt.2SE 9.570531e-02 -0.96842355 0.18239189normtest.W 9.435694e-01 0.92900685 0.94970998normtest.p 1.207084e-02 0.00300526 0.02233218> op <- options(digits=3, scipen=100)> stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick

. . . . .skewness 0.7750 0.23776 0.6861skew.2SE 1.2043 0.36948 1.0662kurtosis 0.1213 -1.22701 0.2311kurt.2SE 0.0957 -0.96842 0.1824normtest.W 0.9436 0.92901 0.9497normtest.p 0.0121 0.00301 0.0223> options(op)

Publishable Tables• Much of the focus in producing

publishable tables in R is on LaTex• Most anthropologists are more

familiar with html• xtable() provides both if there is an

xtable method for your output

Using xtable()• xtable(function-output) produces a

LaTex version of the table• print(xtable(function-output),

type=“html”) converts to html• Appending file=“mytable.html”)

will write the output to a file

> library(psych)> print(xtable(describe(DartPoints[,6:8])), type="html")<TABLE border=1><TR> <TH> </TH> <TH> var </TH> <TH> n </TH> <TH> mean </TH> <TH> sd </TH> <TH> median </TH> <TH> trimmed </TH> <TH> mad </TH> <TH> min </TH> <TH> max </TH> <TH> range </TH> <TH> skew </TH> <TH> kurtosis </TH> <TH> se </TH> </TR> . . . . . </TABLE>

Use print(xtable(x), type=“html”) ; select the html commands (<TABLE> to </TABLE> , copy, and paste into Excel orprint(xtable(x), type=“html”, file=“filename.html”) and insert the file into Excel or Word

SummaryMixed Data

Customize Groups digits= round() xtable()

summary Yes No by() Yes No YesnumSummary - Rcmdr

No Yes Yes print()1 Yes1 Yes1

describe - psych

No Yes Yes print() No Yes

describe - prettyR

Yes Yes by() print() Yes No

stat.desc - pastecs

No Some by() No No Yes

1 Extract table part of results: print(numSummary(x)$table, digits=4);round(print(numSummary(x)$table, 3)print(xtable(numSummary(x)$table), type="html")

Documents

Summarizing Data