38
CA200 (based on the book by Prof. Jane M. Horgan) 1 3. Basics of R cont. Summarising Statistical Data Graphical Displays

Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

CA200

(based on the book by Prof. Jane M. Horgan)

1

3. Basics of R – cont. Summarising Statistical Data

Graphical Displays

Page 2: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Basics

– 6+7*3/2 #general expression [1] 16.5

– x <- 1:4 #integers are assigned to the vector x x #print x [1] 1 2 3 4

– x2 <- x**2 #square the element, or x2<-x^2 x2 [1] 1 4 9 16

– X <- 10 #case sensitive! prod1 <- X*x prod1 [1] 10 20 30 40

CA200 2

Page 3: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Getting Help

• click the Help button on the toolbar

• help()

• help.start()

• demo()

• ?read.table

• help.search ("data.entry")

• apropos (“boxplot”) - "boxplot", "boxplot.default", "boxplot.stat”

3 CA200

Page 4: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Statistics: Measures of Central Tendency

Typical or central points:

• Mean: Sum of all values divided by the number of cases

• Median: Middle value. 50% of data below and 50% above

• Mode: Most commonly occurring value, value with the highest frequency

4 CA200

Page 5: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Statistics: Measures of Dispersion

Spread or variation in the data

• Standard Deviation (σ): The square root of the average squared deviations from the mean

- measures how the data values differ from the mean

- a small standard deviation implies most values are near the average

- a large standard deviation indicates that values are widely spread above and below the average.

5 CA200

Page 6: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Statistics: Measures of Dispersion

Spread or variation in the data

• Range: Lowest and highest value

• Quartiles: Divides data into quarters. 2nd quartile is median

• Interquartile Range: 1st and 3rd quartiles, middle 50% of the data.

6 CA200

Page 7: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry

• Entering data from the screen to a vector

• Example: 1.1 downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29,

30,30,30,33,36,44,45,47,51)

mean(downtime)

[1] 25.04348

median(downtime)

[1] 25

range(downtime)

[1] 0 51

sd(downtime)

[1] 14.27164

7 CA200

Page 8: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry – cont.

• Entering data from a file to a data frame • Example 1.2: Examination results: results.txt

gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99 97 95 96 m 89 92 86 94 m 91 97 91 97 m 100 88 96 85 f 86 82 89 87 and so on

8 CA200

Page 9: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry – cont.

• NA indicates missing value. • No mark for arch1 and prog1 in second record. • results <- read.table ("C:\\results.txt", header = T) # download the file to desired location

• results$arch1[5] [1] 89

• Alternatively • attach(results) • names(results)

• allows you to access without prefix results.

• arch1[5] [1] 89

9 CA200

Page 10: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry – Missing values • mean(arch1)

[1] NA #no result because some marks are missing

• na.rm = T (not available, remove) or

• na.rm = TRUE

• mean(arch1, na.rm = T)

[1] 83.33333

• mean(prog1, na.rm = T)

[1] 84.25

• mean(arch2, na.rm = T)

• mean(prog2, na.rm = T)

• mean(results, na.rm = T)

gender arch1 prog1 arch2 prog2

NA 94.42857 93.00000 89.75000 90.37500

10

Page 11: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry – cont.

• Use “read.table” if data in text file are separated by spaces

• Use “read.csv” when data are separated by commas

• Use “read.csv2” when data are separated by semicolon

11 CA200

Page 12: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Data Entry – cont.

Entering a data into a spreadsheet:

• newdata <- data.frame() #brings up a new spreadsheet called newdata

• fix(newdata) #allows to subsequently add data to this data frame

12 CA200

Page 13: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Summary Statistics Example 1.1: Downtime:

summary(downtime) Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 16.00 25.00 25.04 31.50 51.00

Example 1.2: Examination Results:

summary(results)

Gender arch1 prog1 arch2 prog2

f: 4 Min. : 3.00 Min. :65.00 Min. :56.00 Min. :63.00

m:22 1st Qu.: 79.25 1st Qu.:80.75 1st Qu.:77.75 1st Qu.:77.50

Median : 89.00 Median :82.50 Median :85.50 Median :84.00

Mean : 83.33 Mean :84.25 Mean :81.15 Mean :83.85

3rd Qu.: 96.00 3rd Qu.:90.25 3rd Qu.:91.00 3rd Qu.:92.50

Max. :100.00 Max. :98.00 Max. :96.00 Max. :97.00

NA's : 2.00 NA's : 2.00

Page 14: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Summary Statistics - cont. Example 1.2: Examination Results:

For a separate analysis use:

mean(results$arch1, na.rm=T) # hint: use attach(results)

[1] 83.33333

summary(arch1, na.rm=T)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

3.00 79.25 89.00 83.33 96.00 100.00 2.00

14

Page 15: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Programming in R

• Example 1.3: Write a program to calculate the mean of downtime

Formula for the mean:

x <- sum(downtime) # sum of elements in downtime

n <- length(downtime) #number of elements in the vector

mean_downtime <- x/n

or

mean_downtime <- sum(downtime) / length(downtime)

15

Page 16: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Programming in R – cont.

• Example 1.4: Write a program to calculate the standard deviation of downtime

#hint - use sqrt function

CA200 16

Page 17: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical displays - Boxplots

• Boxplot – a graphical summary based on the median, quartile and extreme values

boxplot(downtime)

• box represents the interquartile

range which contains 50% of cases

• whiskers are lines that extend

from max and min value

• line across the box represents median

• extreme values are cases on more than

1.5box length from max/min value

CA200 17

Page 18: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical displays – Boxplots – cont.

• To improve graphical display use labels:

boxplot(downtime, xlab = "downtime", ylab = "minutes")

18

Page 19: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical displays – Multiple Boxplots

• Multiple boxplots at the same axis - by adding extra arguments to boxplot function:

boxplot(results$arch1, results$arch2,

xlab = " Architecture, Semesters 1 and 2" )

• Conclusions: – marks are lower in sem2

– Range of marks in narrower in sem2

• Note outliers in sem1! 1.5 box length

from max/min value. Atypical values.

Page 20: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical displays – Multiple Boxplots – cont.

• Displays values per gender:

boxplot(arch1~gender,

xlab = "gender", ylab = "Marks(%)",

main = "Architecture Semester 1")

• Note the effect of using:

main = "Architecture Semester 1”

Page 21: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Par

Display plots using par function

• par (mfrow = c(2,2)) #outputs are displayed in 2x2 array • boxplot (arch1~gender, main = "Architecture Semester 1") • boxplot(arch2~gender, main = "Architecture Semester 2") • boxplot(prog1~gender, main = "Programming Semester 1") • boxplot(prog2~gender, main = "Programming Semester 2") To undo matrix type:

• par(mfrow = c(1,1)) #restores graphics to the full screen 21

Page 22: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Par – cont.

Conclusions: - female students are doing less well in programming for sem1 - median for female students for prog. sem1 is lower than for male students

22

Page 23: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Histograms

• A histogram is a graphical display of frequencies in the categories of a variable

hist(arch1, breaks = 5,

xlab ="Marks(%)",

ylab = "Number of students",

main = "Architecture Semester 1“ )

• Note: A histogram with five breaks

equal width

- count observations that fill

within categories or “bins”

23

Page 24: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Histograms

hist(arch2,

xlab ="Marks(%)",

ylab = "Number of students",

main = “Architecture Semester 2“ )

• Note: A histogram with default breaks

CA200 24

Page 25: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Using par with histograms

• The par can be used to represent all the subjects in the diagram

• par (mfrow = c(2,2))

• hist(arch1, xlab = "Architecture",

main = " Semester 1", ylim = c(0, 35))

• hist(arch2, xlab = "Architecture",

main = " Semester 2", ylim = c(0, 35))

• hist(prog1, xlab = "Programming",

main = " ", ylim = c(0, 35))

• hist(prog2, xlab = "Programming",

main = " ", ylim = c(0, 35))

Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects!

CA200 25

Page 26: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

CA200 26

Page 27: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Stem and leaf

• Stem and leaf – more modern way of displaying data! Like histograms: diagrams gives frequencies of categories but gives the actual values in each category

• Stem usually depicts the 10s and the leaves depict units.

stem (downtime, scale = 2)

The decimal point is 1 digit(s) to the right of the | 0 | 012 1 | 2248 2 | 1134589 3 | 00036 4 | 457 5 | 1

CA200 27

Page 28: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Stem and leaf – cont.

• stem(prog1, scale = 2) The decimal point is 1 digit(s) to the right of the | 6 | 5 7 | 12 7 | 66 8 | 01112223 8 | 5788 9 | 012 9 | 7778 Note: e.g. there are many students with mark 80%-85%

CA200 28

Page 29: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Scatter Plots

• To investigate relationship between variables:

plot(prog1, prog2,

xlab = "Programming, Semester 1",

ylab = "Programming, Semester 2")

• Note:

- one variable increases with other!

- students doing well in prog1 will do well

in prog2!

CA200 29

Page 30: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Pairs

• If more than two variables are involved:

courses <- results[2:5]

pairs(courses) #scatter plots for all possible pairs

or

pairs(results[2:5])

CA200 30

Page 31: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Pairs – cont.

CA200 31

Page 32: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical display vs. Summary Statistics

• Importance of graphical display to provide insight into the data!

• Anscombe(1973), four data sets

• Each data set consist of two variables on which there are 11 observations

CA200 32

Page 33: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Graphical display vs. Summary Statistics

Data Set 1 Data Set 2 Data Set 3 Data Set 4 x1 y1 x2 y2 x3 y3 x4 y4 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.10 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.10 4 5.39 19 12.50 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89

CA200 33

Page 34: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

First read the data into separate vectors:

• x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)

• x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74)

• x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42,

5.73)

• x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8) • y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91,

6.89)

CA200 34

Page 35: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

For convenience, group the data into frames:

• dataset1 <- data.frame(x1,y1)

• dataset2 <- data.frame(x2,y2)

• dataset3 <- data.frame(x3,y3)

• dataset4 <- data.frame(x4,y4)

CA200 35

Page 36: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

• It is usual to obtain summary statistics:

1. Calculate the mean:

mean(dataset1) x1 y1 9.000000 7.500909

mean(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 9 9 9 9 mean(data.frame(y1,y2,y3,y4)) y1 y2 y3 y4 7.500909 7.500909 7.500000 7.500909 2. Calculate the standard deviation: sd(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 3.316625 3.316625 3.316625 3.316625 sd(data.frame(y1,y2,y3,y4)) y1 y2 y3 y4 2.031568 2.031657 2.030424 2.030579 Everything seems the same!

CA200 36

Page 37: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

• But when we plot:

• par(mfrow = c(2, 2))

• plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13))

• plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13))

• plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13))

• plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13))

CA200 37

Page 38: Summarising Statistical Data Graphical Displayshruskin/CA200_2_Basics_of_R_2.pdf · 2013. 10. 2. · gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99

Everything seems different! Graphical displays are the core of getting insight/feel for the data!

Note: 1. Data set 1 in linear with some

scatter 2. Data set 2 is quadratic 3. Data set 3 has an outlier.

Without them the data would be linear

4. Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical.

38