View
1
Download
0
Category
Preview:
Citation preview
Original coding by Eric Lecoutre
Initial coding: Sean Lorenz
Introduction to R
Anders Stockmarr
DTU ComputeSection for Statistics and Data Analysis
Technical University of Denmarkanst@dtu.dk
DTU Management EngineeringMay 22, 2017
(DTU) R intro May 22, 2017 1 / 93
Outline
coworkers
Elisabeth Wreford Andersen, The Danish Cancer SocietyKasper Kristensen, DTU AQUAAndes Nielsen, DTU AQUA
(DTU) R intro May 22, 2017 2 / 93
Outline
Outline of Talk
Introduction to RData managementGraphicsLinear Modelsggplot2
(DTU) R intro May 22, 2017 3 / 93
Outline
Outline
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 4 / 93
Introduction to R
Overview
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 5 / 93
Introduction to R
Introduction to R
R is a programming language and a programming environment.It is Free! Developed by users under a GNU license.Runs on a variety of platforms including Windows, Unix and MacOS.You can even get it for Android.Allows for fast implementation of new methods by user demandthrough packages.R has state-of-the-art graphics capabilities.
(DTU) R intro May 22, 2017 6 / 93
Introduction to R
Advantages of R
Frank Harrel in 2009 (my highlighting):
"One point that hasn’t been made very explicitly is one of the greatestadvantages of R:
Getting your work done better and in less time.
Hundreds of companies hire a multitude of SAS programmers to writecode in an archaic language, the SAS macro language. I believe thereis a real cost savings from R because of its value as a data analysis,data manipulation, and graphics environment. Instead ofprogramming using an indirect syntax manipulation environment (SASmacros), in R you can program in a dynamic data-sensitiveframework".
That was 8 years ago. Things have progressed since...(DTU) R intro May 22, 2017 7 / 93
Introduction to R
Base R
Base R and most R packages are available for download at theComprehensive R Archive Network (CRAN).http://www.cran.r-project.orgBase R includes basic data management, analysis and graphics tools.For non-specialized tasks, Base R is all you need.Specialized tasks may be handled by packages.We will download, install and use packages.Packages are not all very well-documented (depends on thecontributor).Want to be sure about what you program does?
Use well-established packages only;or write your own code.
(DTU) R intro May 22, 2017 8 / 93
Introduction to R
RStudio
You can work directly in R.Many prefer another front end (GUI, Graphical User Interface).We will use RStudio.Download from http://www.rstudio.com/
(DTU) R intro May 22, 2017 9 / 93
Introduction to R
RStudio
The GUI RStudio has 4 windows.One for writing the commands (the "script").
Use script for reproducibility.
One for results and interactive use.One for plots, help and packages.One showing which objects are resident in the R memory.
(DTU) R intro May 22, 2017 10 / 93
Introduction to R
R as a calculator2+2
[1] 4
(2*5)+(12/3)-(2^3)
[1] 6
exp(log(1))
[1] 1
sqrt(25)
[1] 5
log(2*2)
[1] 1.3863
log(2)+log(2)
[1] 1.3863(DTU) R intro May 22, 2017 11 / 93
Introduction to R
Writing commands in R
Commands are separated by either a new line or ;R is case sensitive: id is a different name than ID.The character # at the beginning of a line shows that the text in thisline is a comment. I.e. the text is not executed.Help can be found on the internet; from colleagues; or in R by writing? followed by the function you want to help about:
?plot
or, in RStudio, highlight the expression and press F1.
(DTU) R intro May 22, 2017 12 / 93
Introduction to R
Objects in R
Both data and output from analyses are stored as objects (if stored);Some times, output is just displayed on the screen, and you need toassign the object to an identifier to keep it (see below).In fact, everything in the R memory is stored in objects.An object could be a vector, a matrix or a data frame.Values are assigned to objects using the assignment operator
Introduction to R
Generating a sequence
Specify the first and last values separated by a colon.Otherwise use seq()
0:10
[1] 0 1 2 3 4 5 6 7 8 9 10
15:5
[1] 15 14 13 12 11 10 9 8 7 6 5
seq(from = 0, to = 1.2, by = 0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
x
Introduction to R
Generating repeats using rep()
rep(8, 5)
[1] 8 8 8 8 8
rep(1:4, each = 2)
[1] 1 1 2 2 3 3 4 4
rep(1:4, each = 2, times = 3)
[1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
(DTU) R intro May 22, 2017 15 / 93
Introduction to R
Functions in R
We assign a simple function to the identifier f:
>fff
Introduction to R
Functions in R
We have already used many functions with and without default values:
"+"(2,2)sqrt(25)log(2)ls()":"(0,10)seq(from=0.1,to=1.2,by=0.1)rep(1:4,each=2,time=3)
Many applications in R are built up as functions. You can see defaultarguments in the help files. Example: log.
(DTU) R intro May 22, 2017 17 / 93
Introduction to R
Data structures in R: Singles
Logical, e.g:> TRUE[1] TRUE> 1==2[1] FALSE
Single numbers, e.g:> 1[1] 1> 1.2[1] 1.2Character, e.g:> "5"[1] "5"> "abc"[1] "abc"
(DTU) R intro May 22, 2017 18 / 93
Introduction to R
Data structures in R: Vectors
Constructed via the concatenate function c().
Vector of numbers, e.g:
> c(1,1.2,pi,exp(1))[1] 1.000000 1.200000 3.141593 2.718282
We can have vectors of other things too, e.g:
> c(TRUE,1==2)[1] TRUE FALSE> c("a","ab","abc")[1] "a" "ab" "abc"
But not combinations, e.g:> c("a",5,1==2)[1] "a" "5" "FALSE"
Note that R just turned everything into characters!(DTU) R intro May 22, 2017 19 / 93
Introduction to R
Data structures in R: Matrices
Columns of same type and same length:
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)[,1] [,2] [,3][1,] 4.141593 6.141593 8.141593[2,] 5.141593 7.141593 9.141593
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)
Introduction to R
Data structures in R: Data frames
Same length of columns but different types; spread-sheet data.Created from reading in data from external files;or by using the function data.frame() on a set of vectors.
> data.frame(treatment=c("active","active","placebo"),+ bp=c(80,85,90))treatment bp
1 active 802 active 853 placebo 90
Compare to a matrix created with the cbind() command):> cbind(treatment=c("active","active","placebo"),bp=c(80,85,90))
treatment bp[1,] "active" "80"[2,] "active" "85"[3,] "placebo" "90"
(DTU) R intro May 22, 2017 21 / 93
Introduction to R
Data structures in R: Lists
Different length of columns and different types.Most general object type.> list(a=1,b="abc",c=c(1,2,3),d=list(e=matrix(1:4,2), f=function(x){x^2}))$a[1] 1$b[1] "abc"$c[1] 1 2 3$d$d$e
[,1] [,2][1,] 1 3[2,] 2 4$d$ffunction (x){
x^2}
The objects returned from many of the built-in functions in R arefairly complicated lists.
(DTU) R intro May 22, 2017 22 / 93
Importing Data to R
Overview
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 23 / 93
Importing Data to R
Importing Data to R
can be done directly from SAS, SPSS, Excel, STATA etc.The easiest is to use data saved as text files.Usually values in text files are separated, or delimited, by tabs orcommas.First tell R where you want to find your data using the commandsetwd().Check that all went to plan with getwd().
setwd("C:/users/anst/Foredrag/DTU Management Engineering 22052017")getwd()
[1] "C:/users/anst/Foredrag/DTU Management Engineering 22052017"
(DTU) R intro May 22, 2017 24 / 93
Importing Data to R
Importing Data to R
The function read.table() can be used to read data saved as text.Wrappers: read.csv(), read.csv2() and read.delim().Notice the option sep = .We are assigning the loaded data to objects.If you have an Excel sheet, then save as text.
Births.tab
Importing Data to R
Importing Data using RStudio
In the Objects Window, click "Import Dataset"
(DTU) R intro May 22, 2017 26 / 93
Importing Data to R
Importing Data From Other Programs
We can read data from a series of other statistical software packagesusing the package foreign.
# INSTALL AN EXTRA PACKAGEinstall.packages("foreign")
# ACTIVATE THE PACKAGElibrary("foreign")
SPSS_Data
Importing Data to R
Looking At Your Data
There are several ways to look at the data (or parts of the data).
# FIRST FEW OBSERVATIONShead(Births.tab)
id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male3 3 2620 0 38.15 0 35 0 2 female4 4 3751 0 39.80 0 31 0 1 male5 5 3200 0 38.89 0 33 1 1 male6 6 3673 0 40.97 0 33 0 2 female
(DTU) R intro May 22, 2017 28 / 93
Importing Data to R
Looking At Your Data
# LAST FEW OBSERVATIONStail(Births.tab)
id bweight lowbw gestwks preterm matage hyp sex sexalph495 495 2968 0 41.01 0 34 0 1 male496 496 2852 0 38.45 0 28 0 2 female497 497 3187 0 38.03 0 38 1 1 male498 498 3054 0 38.50 0 26 0 2 female499 499 3178 0 39.92 0 31 0 2 female500 500 2918 0 37.97 0 31 0 1 male
# VARIABLE NAMESnames(Births.tab)
[1] "id" "bweight" "lowbw" "gestwks" "preterm" "matage" "hyp"[8] "sex" "sexalph"
# VIEW THE DATA IN A NEW WINDOWView(Births.tab)
(DTU) R intro May 22, 2017 29 / 93
Importing Data to R
Missing values
In R, missing values are coded as NA (not available).In your Excel file leave missing values blank, do not set them to 99 or999.
id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male
(DTU) R intro May 22, 2017 30 / 93
Importing Data to R
Accessing Observations
Data are (usually) stored in a data frame object.Observations are the rows.Variables, either numerical or categorical, are the columns.We can access individual rows, columns and cells in the data frame.For this, we use the bracket operator: object[row, column].
(DTU) R intro May 22, 2017 31 / 93
Importing Data to R
Accessing Observations
# A SINGLE CELLBirths.tab[345, 4]
[1] 38.55
# LEAVING OUT A COLUMN NUMBER INDICATES THAT ALL COLUMNS# ARE CHOSEN. HERE ALL COLUMNS IN ROW 224Births.tab[224 , ]
id bweight lowbw gestwks preterm matage hyp sex sexalph224 224 3216 0 39.94 0 38 1 1 male
(DTU) R intro May 22, 2017 32 / 93
Importing Data to R
Accessing Observations
# LEAVING OUT A ROW NUMBER INDICATES THAT ALL ROWS ARE CHOSEN# HERE ALL ROWS IN COLUMN 5Births.tab[ ,5]
[1] 0 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0[24] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0[47] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0[70] 0 0 0 1 1 0 0 0 0 0 NA 0 0 0 0 0 0 0 0 0 0 NA 0[93] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0 0 0 0 1
[116] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0[139] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[162] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0[185] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0[208] 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0[231] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1[254] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1[277] 0 0 0 0 1 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[300] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0[323] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0[346] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 NA 0 1 0 1 0 0[369] 0 1 0 0 0 0 0 0 1 0 0 0 NA 0 0 0 0 0 0 0 0 0 0[392] 0 0 0 1 NA 0 0 NA NA 0 0 0 0 0 0 0 0 0 0 1 0 0 1[415] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0[438] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0[461] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0[484] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(DTU) R intro May 22, 2017 33 / 93
Importing Data to R
Accessing Observations
# USE RANGES, ROWS 15 TO 18 COLUMNS 1 TO 4Births.tab[15:18, 1:4]
id bweight lowbw gestwks15 15 3662 0 39.2316 16 3035 0 38.9617 17 3351 0 39.3518 18 3804 0 38.99
(DTU) R intro May 22, 2017 34 / 93
Importing Data to R
Accessing Observations
Variables can be accessed directly using their name, either with the $operator (object$variable) the name (object[ ,"variable"]), or the columnnumber (object[ ,k]).
# GET THE BIRTH WEIGHT FOR CHILD 26 TO 36Births.tab$bweight[26:36]
[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509
Births.tab[26:36, "bweight"]
[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509
Births.tab[26:36,2]
[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509
(DTU) R intro May 22, 2017 35 / 93
Importing Data to R
Subsetting using the c() function
The concatenate function c() can be used to access non-sequentialrows and columns from a data frame.
# GET COLUMNS 2, 5, 7, 8, 9 FOR ROW 33Births.tab[33, c(2, 5, 7:9)]
bweight preterm hyp sex sexalph33 2887 0 0 1 male
# GET bweight, preterm and sexalph FOR ROW 71Births.tab[71, c("bweight", "preterm", "sexalph")]
bweight preterm sexalph71 3189 0 male
(DTU) R intro May 22, 2017 36 / 93
Importing Data to R
Variable Names
If we want to change the variable names we can use names().
# NEW VARIABLE NAMESnames(Births.tab)
Importing Data to R
Saving/Exporting data
We can save the data to a textfile, using either write.table() for a tabseparated file, or write.csv()/write.csv2() for a comma/semicolonseparated file (with "."and ","as punctuation mark, respectively).
write.table(Births.tab, file = "Birth_new.txt",sep = "\t", na = ".", row.names= FALSE)
write.csv2(Births.tab, file = "Birth_new.csv")
(DTU) R intro May 22, 2017 38 / 93
Description of Data
Overview
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 39 / 93
Description of Data
Description of Data
We are still looking at the data set with birth weights for 500 children.Using the function str() we can see a description of what our data framecontains (the structure).
str(Births.tab)
'data.frame': 500 obs. of 9 variables:$ id : int 1 2 3 4 5 6 7 8 9 10 ...$ bweight: int 2974 3270 2620 3751 3200 3673 3628 3773 3960 3405 ...$ lowbw : int 0 0 0 0 0 0 0 0 0 0 ...$ gestwks: num 38.5 NA 38.2 39.8 38.9 ...$ preterm: int 0 NA 0 0 0 0 0 0 0 0 ...$ matage : int 34 30 35 31 33 33 29 37 36 39 ...$ hyp : int 0 0 0 0 1 0 0 0 0 0 ...$ sex : int 2 1 2 1 1 2 2 1 2 1 ...$ sexalph: Factor w/ 2 levels "female","male": 1 2 1 2 2 1 1 2 1 2 ...
(DTU) R intro May 22, 2017 40 / 93
Description of Data
Description of Data: Birth weights
The Birth.tab dataset is a data frame with 500 observations and 9variables.Some are integers but “gestwks“ is numeric.The variable “sexalph“ is a factor. This is a categorical variable (eithernumeric or string) with a finite number of levels, here “female“ and“male“.“sexalph“ and “sex“ contains the same info, but “sexalph“ is a factorwhile “sex“ is not.We can convert “sex“ to a factor using as.factor().
(DTU) R intro May 22, 2017 41 / 93
Description of Data
Description of Data: Birth weights
# TELL R THAT sex IS A FACTORBirths.tab$sex
Description of Data
Descriptive Statistics
There are many simple extractor functions for summary statistics in R.Common functions are mean(), sd(), median(), max() and min().
mean(Births.tab$bweight)
[1] 3136.9
sd(Births.tab$bweight)
[1] 637.45
median(Births.tab$bweight)
[1] 3188.5
max(Births.tab$bweight)
[1] 4553
min(Births.tab[ , 2])
[1] 628
(DTU) R intro May 22, 2017 43 / 93
Description of Data
The Summary Function
The function summary() can be used with many objects in R.When used on a data frame we get all the main summary statistics.
# SUMMARY OF THE DATA FRAMEsummary(Births.tab)
id bweight lowbw gestwksMin. : 1 Min. : 628 Min. :0.00 Min. :24.71st Qu.:126 1st Qu.:2862 1st Qu.:0.00 1st Qu.:37.9Median :250 Median :3188 Median :0.00 Median :39.1Mean :250 Mean :3137 Mean :0.12 Mean :38.73rd Qu.:375 3rd Qu.:3551 3rd Qu.:0.00 3rd Qu.:40.1Max. :500 Max. :4553 Max. :1.00 Max. :43.2
NA's :10preterm matage hyp sex sexalph
Min. :0.000 Min. :23 Min. :0.000 1:264 female:2361st Qu.:0.000 1st Qu.:31 1st Qu.:0.000 2:236 male :264Median :0.000 Median :34 Median :0.000Mean :0.129 Mean :34 Mean :0.1443rd Qu.:0.000 3rd Qu.:37 3rd Qu.:0.000Max. :1.000 Max. :43 Max. :1.000NA's :10
(DTU) R intro May 22, 2017 44 / 93
Description of Data
Summaries
We may only want summaries for some of the data, e.g. babies withbirth weight < 2900g.We subset the data and then summarize as before:
summary(Births.tab[Births.tab$bweight
Description of Data
Group Summaries
We can work on data separated by groups.Suppose that we want to calculate the mean birth weight for boys andgirls (many ways to do this).We will use the tapply() function to apply the mean function to thetwo levels of “sexalph“.tapply(, , ).
# MEAN BIRTH WEIGHT FOR BOYS AND GIRLStapply(Births.tab$bweight, Births.tab$sexalph, mean)
female male3032.831 3229.902
(DTU) R intro May 22, 2017 46 / 93
Description of Data
Histogram
Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables.
hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")
Title
Birth weight (g)
Fre
quen
cy
1000 2000 3000 4000 5000
050
100
150
(DTU) R intro May 22, 2017 47 / 93
Description of Data
Histogram
Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables. Here with a box on.
hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")box()
Title
Birth weight (g)
Fre
quen
cy
1000 2000 3000 4000 5000
050
100
150
(DTU) R intro May 22, 2017 48 / 93
Description of Data
Boxplot
Boxplots show the median, upper, lower quartiles and potentially extremevalues.
boxplot(Births.tab$bweight, xlab = "Birth weight (g)")
●
●●●
●
●
●●
●
●
●●
●
●
●●●●●
●●●●●
●
1000
3000
Birth weight (g)
(DTU) R intro May 22, 2017 49 / 93
Modifying Data
Overview
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 50 / 93
Modifying Data
Modifying Data
We will concentrate on how to modify and rearrange our data.Data can be sorted with the order function.order can sort the Birth.tab data by “sex“, and then by “bweight“.The order function returns a vector of sorted indices, which we applyto the rows of the unsorted data frame to get a sorted version.
Birth_sort
Modifying Data
Creating new variables and deleting old
New variables can be added to a data frame.
# ADD A VARIABLE TO DATA FRAMEBirths.tab$log_bweight
Modifying Data
Grouping the values of a variable using cut
If you want to group a continuous variable e.g. mother’s age (matage) intothe groups: ]20-30], ]30-35], ]35-40], ]40-45].
Births.tab$agegrp
Modifying Data
Creating new variables: RowSums
Often we want to form new variables from other variables.For example we might want to calculate a total score from some subscores.We can sum variables using rowSums. Related functions are:rowMeans, colSums, colMeans.Notice the effect of the option na.rm:na.rm= FALSE: If we take a row sum where one of the values ismissing then the row sum is set to missing.na.rm= TRUE: If we want to ignore missing values and calculate asum of the non-missing.rowSums, rowMeans, colSums and colMeans are wrappers of sapply,ie. t.ex. colMeans(x) is the same as sapply(x,mean). sapply can beused with many other functions.
(DTU) R intro May 22, 2017 54 / 93
Modifying Data
Creating new variables: RowSums
# WANT TO MAKE A NEW VARIABLE SUMMING PRETERM, LOWBW AND HYPBirths.tab$score
Modifying Data
Split Data: Subset
Sometimes we may need to split our data.In the Births data we may need to split the data into boys and girls.We can use the subset() function and assign the new data sets toseparate R objects.Notice == (logical expression). We are not assigning a value to “sex“,but asking whether “sex is equal to 1“.
Births.Male
Modifying Data
Subset
Often data sets come with a lot of variables and we only want to use afew.The function subset() can also be used to select the variables we want.Notice the select option. This is needed to say that we want a subsetof columns (on the previous slide it was rows).Notice that we do not need quotes in select.
# SELECT 3 VARIABLESBirths.new
Modifying Data
Aggregating data
Sometimes we want to make a new dataframe as a summary of theoriginal dataframe on the basis of factor levels.Below we want to make a new dataframe with the mean birthweightfor combinations of preterm and sex.
PreSex
Modifying Data
Add rows: rbind
Suppose that aata are collected for subgroups of subjects and saved inseparate objects.The separate objects are appended (stacked) to create a single object.This will give an error message if the number of columns differs.
# APPENDBirths.Both
Modifying Data
Add variables: merge
Often you have data in several data sets and want to combine the data setsby merging using one or more variables as key variables. Adding variables toa master data set.
Person Data
Id, age, sex, race Answers to ques-
tionnaire:
Id, q1,…,q10
Merged data: Person data and answers. Id, age, sex, race, q1,…,q10
PDFil
l PDF
Edit
or wi
th Fr
ee W
riter
and T
ools
(DTU) R intro May 22, 2017 60 / 93
Modifying Data
Merge
We have two data sets with a key variable "id". One with backgroundinformation and one set with blood pressure measurements.
agesex
Modifying Data
4 Different Merges
In the merge function we will look at 4 of the options.We have merge(x, y, by = "key variable", all = TRUE, < all =FALSE, all.x = TRUE, all.y =FALSE > ).Here x and y are data frames
(DTU) R intro May 22, 2017 62 / 93
Modifying Data
Merging all=FALSE
merge_small
Modifying Data
Merging all=TRUE
merge_large
Modifying Data
Merging all.x=TRUE
merge_x
Modifying Data
Merging all.y=TRUE
merge_y
Modifying Data
Counting the Missing Observations: The is.na() and sum()functions
Suppose that we want to count the number of missing observations.The function is.na returns a logical vector that is TRUE when a valueis missing and FALSE otherwise.
is.na(merge_y$sex)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#COUNT MISSING FOR ONE VARIABLEsum(is.na(merge_y$sex))
[1] 1
#COUNT FOR DATA FRAMEcolSums(is.na(merge_y))
id age sex visit bp0 2 1 0 0
(DTU) R intro May 22, 2017 67 / 93
Modifying Data
Saving your work
Saving your scriptSaving your workspace
Always save your script - do it often if you work in Rstudio.
Reasons for saving your workspace:Extensive data creations will be there next time you open yourworkspace.Objects created ’on the fly’ (not in your script) will be there.
Reasons for not saving your workspace:With a well-written script, you can recreate your analysis in seconds,unless you work with huge amounts of data.Edited and saved data where editions have been forgotten may causehavoc on your results.Left-over objects created for various purposes may enter yourcalculations unintentionally due to the structure of R’s search path.
(DTU) R intro May 22, 2017 68 / 93
Modifying Data
Saving your work
How to save your work:
Script: Click on the script and press ’save’ in Rstudio and the plain RGUI.Workspace: Click on the command prompt and press ’save’.Alternatively, use the save.image() functionBoth: Accept when asked after terminating Rstudio or the plain R GUI.
(DTU) R intro May 22, 2017 69 / 93
Graphics
Overview
1 Introduction to R
2 Importing Data to R
3 Description of Data
4 Modifying Data
5 GraphicsHistogramBox plotScatter PlotLine plot
(DTU) R intro May 22, 2017 70 / 93
Graphics
Visualizing Data
Whenever we want to analyze data, the first thing we do is to have alook at it.How are the observations spread out? What are the most commonvalues? Are there any unusual observations? Are there anyrelationships between variables? Etc.
The graphics section will not tell you all about graphics in R but get yougoing.
(DTU) R intro May 22, 2017 71 / 93
Graphics
R Graphics Systems
base The original/default graphics system in R.Example:
demo(graphics)
Highly customizable; but complex plots require much code.
lattice Shorter syntax for complex (e.g. multipanel) plots. Lesscustumizable than base.
Example:library(lattice)demo(lattice)
ggplot2 By Hadley Wickham; builds on the same ideas as lattice.gg = “grammar of graphics”Example:
library(ggplot2)example(qplot)
(DTU) R intro May 22, 2017 72 / 93
Graphics Histogram
A Basic Histogram
Common way to examine the distribution of a continuous variable.The range of the variable is by default divided into equal-widthintervals (bins). Plots the number of observations in each bin (unlessyou specify otherwise).
hist(Births$bweight)
Histogram of Births$bweight
Births$bweight
Fre
quen
cy
1000 2000 3000 4000 5000
050
150
Note that R automatically creates axis labels and a heading.(DTU) R intro May 22, 2017 73 / 93
Graphics Histogram
Histogram with a few options
To modify axis labels we set the options xlab and ylab.The heading is set in the option main.
hist(Births$bweight, xlab = "Birth weight (g)",main = "Histogram of Birth Weight")
Histogram of birth weight
Birth weight (g)
Fre
quen
cy
1000 2000 3000 4000 5000
010
0
(DTU) R intro May 22, 2017 74 / 93
Graphics Histogram
Histogram with more options
We could type ?hist to find more options to customize the histogram.The available colours are coded as numbers or one can write col =“red“If we want shading we can try the density function.The angle of the numbers on the axes is set by the option las.
hist(Births$bweight,las = 1, main = "Histogram of birth weight",col = 2, density = 7)
Histogram of birth weight
Births$bweight
Fre
quen
cy
1000 2000 3000 4000 5000
050
100150
(DTU) R intro May 22, 2017 75 / 93
Graphics Histogram
How to get your plot from RStudio
(DTU) R intro May 22, 2017 76 / 93
Graphics Box plot
A Basic Box Plot
Box plots show a measure of the location (the median line).The spread of the distribution (the length of the box and whiskers).Skewness as asymmetry in the upper and lower parts of the box andwhisker length.We use the function boxplot(variable). Adding labels to the axes andcolours is done as for hist.
(DTU) R intro May 22, 2017 77 / 93
Graphics Box plot
Histograms and a Box Plot
(DTU) R intro May 22, 2017 78 / 93
Graphics Box plot
A Basic Box Plot
When describing data we can even add the observations to the plot.Notice the function rug shows the observations.
boxplot(Births$bweight, xlab = "Birth weight (g)", horizontal = TRUE,col = 6)
rug(Births$bweight)
● ●●●● ●● ●● ●●● ●● ●●●● ●● ●●● ●●
1000 2000 3000 4000
Birth weight (g)(DTU) R intro May 22, 2017 79 / 93
Graphics Box plot
Box Plot for Groups
A very useful feature is that we can make box plots for different groupsnext to each other for comparison. Notice the option data = Births.
# BOX PLOT FOR BOYS AND GIRLSboxplot(bweight ~ sexalph, data = Births, las = 1,
ylab = "Birth weight (g)", col = 2:3)
●
●
●
●
●
●●
●
●●●
●●●●●
●
●
●
●●●
●●●
●●●
female male
1000
2000
3000
4000
Bir
th w
eigh
t (g)
(DTU) R intro May 22, 2017 80 / 93
Graphics Box plot
Box Plot for GroupsSet our own axis. Notice xaxt = “n“.
# BOX PLOT WHERE WE WANT TO MAKE OUR OWN AXISboxplot(bweight ~ sexalph, data = Births, las = 1,
ylab = "Birth weight (g)", col = c("red", "blue"), xaxt = "n")axis(1 ,at = c(1,2), labels = c('Girl', 'Boy'))
●
●
●
●
●
●●
●
●●●
●●●●●
●
●
●
●●●
●●●
●●●
1000
2000
3000
4000
Bir
th w
eigh
t (g)
Girl Boy
(DTU) R intro May 22, 2017 81 / 93
Graphics Scatter Plot
The Basic Scatter Plot
The scatter plot is the standard graph for examining the relationshipbetween two continuous variables.The plot(x,y) function is used to create scatter plots. Where (x,y) arethe points we want to plot.We will look at the relationship between car weight (lbs/1000) andmiles per gallon for 32 cars.
plot(mtcars$wt, mtcars$mpg)lines (sort(mtcars$wt),37.285-5.344*sort(mtcars$wt),type="l")
(DTU) R intro May 22, 2017 82 / 93
Graphics Scatter Plot
The Basic Scatter Plot
2 3 4 5
1015
2025
30
mtcars$wt
mtc
ars$
mpg
(DTU) R intro May 22, 2017 83 / 93
Graphics Scatter Plot
The Scatter Plot
We can customize the scatter plot similar to before.The function abline adds a straight line to the plot.When we write abline(lm(mpg ∼ wt)) we get the best fitting line.
plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)
abline(lm(mtcars$mpg ~ mtcars$wt), lty = 1, col = 3)
● ●● ●●●
●
●●●● ●●●
● ●●
●●
●
●
●● ●
●
● ●●
●●
●
●
2 3 4 5
1015202530
Car weight (lbs/1000)
Mile
s pe
r ga
llon
(DTU) R intro May 22, 2017 84 / 93
Graphics Scatter Plot
abline
The function abline can also add reference lines to a plot.A horizontal line, e.g. at 25 and 30 abline(h = c(25, 30))A vertical line, e.g. at 2 and 5 abline(v = c(2, 5))
plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)
abline(h = c(25, 30), col = c("red", "magenta"), lty = 2)abline(v = c(2, 5), col = 4:5, lty = 3:4)
● ●● ●●●
●
●●●● ●●●
● ●●
●●●
●
●● ●
●
● ●●
●●
●
●
2 3 4 5
1015202530
Car weight (lbs/1000)
Mile
s pe
r ga
llon
(DTU) R intro May 22, 2017 85 / 93
Graphics Scatter Plot
Add a smoothed line
Perhaps we do not think the association is linear and try a nonparametricsmoothed line.
plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1)
abline(lm(mtcars$mpg ~ mtcars$wt), lty = 2, col = 4)lines(lowess(mtcars$wt, mtcars$mpg), lty = 1, col = 2)
(DTU) R intro May 22, 2017 86 / 93
Graphics Scatter Plot
Add a smoothed line
● ●
●●
●●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
2 3 4 5
10
15
20
25
30
Car weight (lbs/1000)
Mile
s pe
r ga
llon
(DTU) R intro May 22, 2017 87 / 93
Graphics Scatter Plot
Enhanced graph procedures: Scatter plot example from the"car"package
scatterplot(mpg ~ wt | cyl, data = mtcars, ylim = c(0,40),xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1,legend.plot = TRUE,legend.coords = "topright",id.method = "identify",labels = row.names(mtcars),boxplots = "xy")
Here we want to plot miles per gallon versus weight for cars that have 4, 6 or 8 cylinders.We write this as mpg ∼ wt | cyl.By default we get different colours for groups and both a linear and a smoothed line.A legend is included in the top right corner of the plot.The option id.method = “identify“ means that points can be identified by mouse clicks.Box plots of miles per gallon and weight included ("xy"option for both axes).More possibilities: ?scatterplot.
(DTU) R intro May 22, 2017 88 / 93
Graphics Scatter Plot
The resulting scatter plot
(DTU) R intro May 22, 2017 89 / 93
Graphics Line plot
A Line Plot
Connecting points in a scatter plot from left to right. Here the growth of atree. Notice the option type = “b“ meaning points joined by lines.
plot(TreeA$age, TreeA$circumference, type = "b", xlab = "Age (days)",ylab = "Circumference (mm)", las = 1)
●
●
●
●●
● ●
500 1500
50
100
150
200
Age (days)
Circ
umfe
renc
e (m
m)
●
●
●
●●
● ●
500 1500
50
100
150
200
Age (days)
Circ
umfe
renc
e (m
m)
(DTU) R intro May 22, 2017 90 / 93
Graphics Line plot
Difference between plot() and lines() functions
We have seen both the plot and the lines functions.The plot function creates a new graph. It is a high-level plottingfunction.The lines function adds information to an existing graph but it cannotproduce it’s own graph. It is a low-level plotting function.A high-level plotting function can (often) be converted to a low-levelplotting function with the option ADD=TRUE.Usually lines will be used after a high-level plotting function (such asplot) has produced a graph.
(DTU) R intro May 22, 2017 91 / 93
Graphics Line plot
A line plot and a legend
plot(TreeA$age, TreeA$circumference, type = "b", lty = 1,xlab = "Age (days)",ylab = "Circumference (mm)", las = 1, col= 2)
lines(TreeB$age, TreeB$circumference, type = "b", col = 3, lty = 2)legend(locator(1), # we will place it with a mouse click
legend = c("A","B"), title = "Tree",lty = 1:2, col= 2:3)
(DTU) R intro May 22, 2017 92 / 93
Graphics Line plot
Layout of several plots on one graph
Several plots on one graph:
Use the option par(mfrow = c(2, 2)) and back to one plot par(mfrow =c(1, 1)). For other options: Check the layout() function
(DTU) R intro May 22, 2017 93 / 93
Linear Models
Linear models
Statistical models of a linear relationship between variables:
Yi = α+ βXi + εi , i = 1, . . . , n.
� Dependent variable: Y .
� Independent variable: X .
� Stochastic term/error term: ε.
The εi ’s should be a) stochastically independent, and b) identically normallydistributed, with mean 0, and variance σ2 for some positive number σ2 > 0.
� Model parameters: α, β and σ2.
Linear models: Example
Y = 1 + 0.5X + ε
> plot(X,Y,xlab=�X�,ylab=�Y�)
> lines(sort(X),1+0.5*sort(X),lwd=3,col="red")
0 1 2 3 4 5
01
23
4
X
Y
Linear models: Example
Model residuals: The random/stochastic term.
> residuals.Y 0.
f
Fitting linear models: The lm() function
Y = α+ βX + ε
In R, linear models can be fitted to data with the lm() function:
> analysis analysis
Call:
lm(formula = Y ~ X)
Coefficients:
(Intercept) X
0.9702 0.5155
α̂ is the intercept 0.97, while β̂ is the estimated coefficient to X, 0.52.
Model formulas
The argument to lm() is a formula object.
� A linear model is specified by a formula object, which t.ex. may look likethis:
> my.formula fit fit fit
The lm object: Model diagnostics
> analysis
The lm object: Contents
� An lm object is a list, and contains a lot of information. See the contentswith the names() function:
> analysis names(analysis)
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
� Access the contents with the $ operator; eg.
> analysis$coef
(Intercept) X
0.9701906 0.5154684
� Some of the 12 components of the list are lists themselves. Find moreinformation by applying str().
The lm object: Summaries
The summary() fuction may be applied to lm objects as well:
> analysis summary(analysis)
Call:lm(formula = Y ~ X)
Residuals:Min 1Q Median 3Q Max
-1.61297 -0.40132 0.07808 0.55124 1.32380
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97019 0.09182 10.566 < 2e-16 ***X 0.51547 0.05410 9.527 1.29e-15 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1
Residual standard error: 0.6861 on 98 degrees of freedomMultiple R-squared: 0.4808, Adjusted R-squared: 0.4755F-statistic: 90.77 on 1 and 98 DF, p-value: 1.286e-15
The lm object: Summaries
The summary is a R list object itself, with sub-elements that can be accessed:
> analysis names(summary(analysis))
[1] "call" "terms" "residuals"
[4] "coefficients" "aliased" "sigma"
[7] "df" "r.squared" "adj.r.squared"
[10] "fstatistic" "cov.unscaled"
We can find the estimate σ̂2 for σ2 as
> summary(analysis)$sigma^2
[1] 0.4707802
Modeling nonlinear relations with lm()
> plot(X,Y)
> lines(sort(X),predict(lm(Y~X))[order(X)])
0 1 2 3 4 5
040
080
012
00
X
Y
Relationship with Y and X is not linear. How to proceed with lm()?
Modeling nonlinear relations with lm()
� The I-operator in formulas:
> analysis plot(X,Y)
> lines(sort(X),predict(analysis)[order(X)],type="l" )
0 1 2 3 4 5
040
080
012
00
X
Y
’Linear’ in lm() is relative to the ’right’ independent variables.
Extraction functions
� Some important extraction functions for obtaining information:
coef() Estimated model parametersconfint() Confidence intervals for estimated model parameters
residuals() Raw residualsrstandard() Standardized residuals
model.matrix() The design matrixpredict() Predictions from model
vcov() Covariance matrix for estimated model parametersanova() Anova test table for model reductiondrop1() Test for dropping one term from model
summary() A summary printout, and access to summary statistics
� Statistical tests: drop1() is usually the function to use.
Factors and interactions
A dataset on Sex, Age,and a response Y:
> summary(my.data)
Sex Age YFemale:50 Min. :18.70 Min. : 3.091Male :50 1st Qu.:36.38 1st Qu.: 9.274
Median :51.12 Median :12.430Mean :49.99 Mean :12.7113rd Qu.:63.22 3rd Qu.:15.972Max. :77.31 Max. :21.800
plot(my.data$Age, my.data$Y,xlab=��,ylab=�Y�,col=my.data$Sex)legend(20,20,c("Female","Male"),col=1:2,pch=1)
20 30 40 50 60 70
510
1520
Y
FemaleMale
Factors and interactions
Model: Interaction between Sex and Age. Testing the interaction term withdrop1():
> analysis drop1(analysis,test="F")
Single term deletions
Model:
Y ~ Age + Sex + Age:Sex
Df Sum of Sq RSS AIC F value Pr(>F)
102.77 10.737
Age:Sex 1 33.131 135.91 36.679 30.947 2.391e-07 ***
---
Signif. codes:
0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1
The interaction is for real and cannot be removed.
Factors and interactions
> summary(analysis)
Call:lm(formula = Y ~ Age + Sex + Age:Sex, data = my.data)
Residuals:Min 1Q Median 3Q Max
-2.60300 -0.53551 0.00317 0.59830 2.43544
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.696993 0.452879 3.747 0.000305 ***Age 0.187921 0.009045 20.777 < 2e-16 ***SexMale -0.525623 0.677086 -0.776 0.439479Age:SexMale 0.071599 0.012871 5.563 2.39e-07 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1
Residual standard error: 1.035 on 96 degrees of freedomMultiple R-squared: 0.945, Adjusted R-squared: 0.9433F-statistic: 550.2 on 3 and 96 DF, p-value: < 2.2e-16
� R selects the first level of the Sex variable; similarly for the interactionterm.
Factors and interactions
> my.data2 with(my.data2,{+ plot(Age,Y,xlab=�Age�,ylab=�Y�,col=Sex)+ lines(Age[Sex=="Female"],predicted[Sex=="Female"],col=1,type="l")+ lines(Age[Sex=="Male"],predicted[Sex=="Male"],col=2,type="l")+ legend(20,20,c("Female","Male"),col=1:2,pch=1)+ })
20 30 40 50 60 70
510
1520
Age
Y
FemaleMale
More on formula objects
� Model formulae are symbolic. We have seen the use of ’+’ and ’:’, andadding a 0 or -1.
� The product ’*’ crosses variables: Expands to main effects andinteractions:
y ~ x*z
corresponds to
$y~x+z+x:z$
� Powers expands effects to the specified order:
y~(x+z+w)^2
corresponds to
y~x+z+w+x:z+x:w+z:w
� The subtraction function ’-’ removes variables if possible:
y~(x+z+w)^2-x:z-a:b
corresponds to
y~x+z+w+x:w+z:w
More on formula objects
The I() function overrides the symbolic interpretation, and invokes the usualarithmetic instead.
Observe that
y~(x*z)^2 = y~(x+z)^2
But
y~I((x*z)^2) and y~I((x+z)^2)
are two different model formulas; regressing y on x2z2 and x2 + z2 + 2xz ,respectively.
Formulas when transforming data into normality
� Sometimes it is possible to transform data, such that it matches a linearmodel.
� For instance if the variance is increasing with the mean
1 2 3 4 5 6 7
−3−2
−10
12
Raw data
Prediction
Res
idua
l
0.0 0.5 1.0 1.5 2.0
−0.6
−0.2
0.2
0.6
Log transformed
Prediction
Res
idua
l� A log transformation is often appropriate in this case.
� This may be done directly in a formula object. T. ex:
log(y)~log(x)+log(z)
Generalized linear models - the glm() function
� Some types of observations can never be transformed into normality
� Example: binary data; ones and zeroes.
� For a wide class of distributions, the so called exponential families, we canuse generalized linear models:
� Formulate linear models for a transformation of the mean value.
� No transformation of observations, thereby preserving their distributionalproperties.
� Allows easy modeling in R with the glm() function, nearly identical tolm().
� Standard example: Logistic regression.
GLM vs GLM
General linear models Generalized linear models
Normal distribution Exponential dispersion family
Mean value linear Function of mean value linear
Independent observations Independent observations
Same variance Variance function of mean
lm() easy to apply glm() almost as easy to apply
Types of response variables
i Count data (y1 = 57, . . ., yn = 59 accidents) - Poisson distribution.
ii Binary response variables (y1 = 0, y2 = 1, . . ., yn = 0), or frequencies ofcounts (y1 = 15/297, . . ., yn = 144/285) - Binomial distribution.
iii Count data, waiting times - Negative Binomial distribution.
iv Multiple ordered categories ”Unsatisfied”, ”Neutral”, ”Satisfied” -Multinomial distribution.
v Count data, multiple categories - Multinomial distribution..
vi Continuous responses, constant variance (y1 = 2.567, . . ., yn = 2.422) -Normal distribution.
vii Continuous positive responses with constant coefficient of variation -Gamma distribution.
Logistic regression example
In a study of developmental toxicity of a chemical compound, a specifiedamount of an ether was dosed daily to pregnant mice, and after 10 days allfetuses were examined. The size of each litter and the number of stillbornswere recorded:
Index Number of Number of Fraction still- Concentrationstillborn, zi fetuses, ni born, yi [mg/kg/day], xi
1 15 297 0.0505 0.02 17 242 0.0702 62.53 22 312 0.0705 125.04 38 299 0.1271 250.05 144 285 0.5053 500.0
Table: Results of a dose-response experiment on pregnant mice. Number of stillbornfetuses found for various dose levels of a toxic agent.
Reported in Price et al. (1987).
Logistic regression example
Let Zi denote the number of stillborns at dose concentration xi .
We shall assume Zi ∼ B(ni , pi ), that is a binomial distribution corresponding toni independent trials (fetuses), and the probability, pi , of stillbirth being thesame for all ni fetuses.
We want to model Yi = Zi/ni . In particular, we will look for a model forE [Yi ] = pi .
Logistic regression example
� A natural quantity to consider is the odds, p/(1− p); varies on (0;∞),more natural than (0; 1) where p varies.
� since effects on the odds are often multiplicative, we take the log toconvert the effects to additive form.
� we arrive at the logit function:
logit(p) = log( p1− p
).
for this model, the logit function is our link function. We will formulate alinear model for the mean values transformed with the link function:
ηi = logit(pi ), i = 1, . . . , 5.
The linear model is
ηi = α+ βxi , i = 1, . . . , 5.
� The inverse transformation, which gives the probabilities, pi , for stillbirthis the so-called logistic function:
pi =exp(α+ βxi )
1 + exp(α+ βxi ), i = 1, . . . , 5.
Logistic regression example
> mice mice$resp mice.glm
Logistic regression example
> summary(mice.glm)
Call:
glm(formula = resp ~ conc, family = binomial(link = logit), data = mice)
Deviance Residuals:
1 2 3 4 5
1.1317 1.0174 -0.5968 -1.6464 0.6284
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.2479337 0.1576602 -20.6
Logistic regression exampleThe linear predictor, ŷi = α̂+ β̂xi :
0 100 200 300 400 500
−3.5
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Concentration
logi
t(stil
l bor
n fra
ctio
n)
Figure: Logit transformed observations and corresponding linear predictions for doseresponse assay.
Logistic regression example
Predicted stillborn fractions, p̂i = exp(ŷi )/(1 + exp(ŷi )):
0 100 200 300 400 500
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Concentration
Still
born
frac
tion
Figure: Observed stillborn fractions and corresponding fitted values under logisticregression for dose response assay.
Specification of a generalized linear model in glm()
> mice.glm
ggplot2
� Basic plotting function: ggplot(). Used for advanced plots.
� Wrapper that resembles plot() from the basic graphics system: qplot().Used for ’quick’ plots. Syntax resembles that of plot().
� Grammar of graphics:� All plots are objects. You build them incrementally. Use the operator + to
add to an existing plot.� Layer: Aestetics (aes): Defines how the data are mapped.� Layer: Geometric objects (geom): Points, lines, polygens, etc.� Layer: Coordinate system objects (coord).
Example: Diamond data
� Load the ggplot2 package and take a look at the diamond data:> library(ggplot2)
> head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
� Quick plot
> qplot(carat, price, data=diamonds)
0
5000
10000
15000
0 1 2 3 4 5carat
pric
e
Modifying the quick plot
� With qplot(), it is easy to work with:color Color each point according to a variable in the dataset, and add a
corresponding legend.log Log-transform one or both axes.
facets Split in a multi-panel plot according to a group variable.main Add a title
� Let’s try modifying the previous plot by adding� color = cut� log = ”xy”� facets =∼ clarity� main = ”Diamonds”
one-by-one
Modified plot
> qplot(carat,
+ price,
+ data=diamonds,
+ color = cut,
+ log="xy",
+ facets=~clarity,
+ main="Diamonds")
I1 SI2 SI1
VS2 VS1 VVS2
VVS1 IF
1000
10000
1000
10000
1000
10000
1 1carat
pric
e
cutFair
Good
Very Good
Premium
Ideal
Diamonds
Incremental plot construction
� qplot is good for a start. However, in order to take full advantage ofggplot2, we must know what the plot is built of and how to modify theparts.
� The quick plot qplot(carat, price, data=diamonds) can be builtincrementally by
� Define an empty plot object:> p p p p
� We can use the ’+’ operator to modify the plot ’p’. Lets see someexamples in the following:
Change the plot type (geom)
� Get an overview of possible geoms at http://docs.ggplot2.org.
� You can also look at the examples in the documentation:
> example(geom_boxplot)
> example(geom_polygon)
> example(geom_raster)
� Example: add a 2D density on top:
> p + geom_density2d()
0
5000
10000
15000
0 1 2 3 4 5carat
pric
e
Change the coordinate transformations
> p + coord_flip()
0
1
2
3
4
5
0 5000 10000 15000price
cara
t
> p + coord_polar()
1
2
3
4
5
5000
10000
15000
carat
pric
e
Change to multiplanel display
� Add a facet grid to split the plot in multiple panels.
� A facet grid takes a formula as input.
� Example:
> p + facet_grid(. ~ cut)
Fair Good Very Good Premium Ideal
0
5000
10000
15000
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat
pric
e
Change to multiplanel display - other facets
> p + facet_grid(cut ~ .)
0
5000
10000
15000
0
5000
10000
15000
0
5000
10000
15000
0
5000
10000
15000
0
5000
10000
15000
FairG
oodVery G
oodPrem
iumIdeal
0 1 2 3 4 5carat
pric
e
Change to multiplanel display - other facets
> p + facet_grid(cut ~ color)
D E F G H I J
05000
1000015000
05000
1000015000
05000
1000015000
05000
1000015000
05000
1000015000
FairG
oodVery G
oodPrem
iumIdeal
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat
pric
e
Density plot and alpha blending
� Attributes (colour, shape, fill, linetype etc) have automatically becomegrouping variables.
� Note the specification of transparancy through the alpha argument: alphablending.
> ggplot(diamonds) +
+ aes(price, fill=cut) +
+ geom_density(alpha=.3)
0e+00
1e−04
2e−04
3e−04
4e−04
0 5000 10000 15000price
dens
ity
cutFair
Good
Very Good
Premium
Ideal
Application: Maps
� the ggmap package: Interfacing ggplot2 and RGoogleMaps.
� Two steps in making a map with ggmap:1. download raster dta for the map;2. create the map with ggmap(), and overlay it with layers of geoms etc.
� Downloading raster data: Specify: a) location of center; b) the zoomfactor.
� : Location specification for map downloads: Two ways.1. location/address:
> myLocation myLocation
A first map: The London Olympic Stadium
� Download data with the get_map() function, plot with ggmap():
> mapData1 ggmap(mapData1,extent = "panel",ylab = "Latitude",xlab = "Longitude")
51.536
51.538
51.540
51.542
−0.020 −0.016 −0.012lon
lat
The London Olympic Stadium - same but different
� Different map type:
> mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")
51.536
51.538
51.540
51.542
−0.020 −0.016 −0.012lon
lat
The London Olympic Stadium - same but hybrid
� Different map type:
> mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")
51.536
51.538
51.540
51.542
−0.020 −0.016 −0.012lon
lat
Overlaying maps� Geographic coordinates obtained with the geocode() function:
> geocode("University of Washington")
lon lat1 -106.4407 31.76788
� A map of the USA: Lets overlay this map with data.> usa_center USA USA
20
30
40
50
−120 −110 −100 −90 −80 −70lon
lat
Fatal vehicle accidents in the USA 2012
� mv_collisions data:
> head(mv_collisions)
state collisions1 Alabama 782 Arizona 1453 Arkansas 464 California 7225 Colorado 776 Connecticut 40
� Getting the geocoordinates with geocode():
> for (i in 1:nrow(mv_collisions)) {+ latlon = geocode(mv_collisions$state[i])+ mv_collisions$lon[i] = as.numeric(latlon[1])+ mv_collisions$lat[i] = as.numeric(latlon[2])+ }
� Getting the map:
> usa_center = geocode("United States")> USA
Fatal vehicle accidents in the USA 2012
� Overlaying the data:> circle_scale USA + geom_point(aes(x=lon, y=lat), data=mv_collisions, col="red",+ alpha=0.4, size=mv_collisions$collisions*circle_scale)
20
30
40
50
−120 −110 −100 −90 −80 −70lon
lat
Credits
� Original coding of the fatal motor vehicle collision example: Sean Lorenz.
� ggmap:D. Kahle and H. Wickham (2013): ggmap: Spatial Visualization withggplot2.The R Journal, 5(1), 144-161. URL: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Where to go?R posibilities are endless:
• R shiny – web applications• dplyr – data management• RODBC – Reading from SQL databases etc.• TwitteR – text analytics of tweets• GoogleAnalyticsR – Google search analytics• Data Science with R on the Edx platform – Online course by yours
truly…• Or just practice… and check t.test()…
Intro R DTU Management EngineeringR frontpagegraphics teaser
Rintro1Introduction to RImporting Data to RDescription of DataModifying DataGraphicsHistogramBox plotScatter PlotLine plot
Rintro2Rintro3
where to goWhere to go?
Recommended