Upload
vohanh
View
213
Download
0
Embed Size (px)
Citation preview
R Workshop (16/06/2017 Durham University)
Preliminaries
Aim: The aim is to familiarise yourself with some of the basic concepts and tools of the statistical computingenvironment R and the IDE R Studio.
R software: R and RStudio are freely available across all platforms.
To download R, visit:
(Windows) https://cran.rproject.org/bin/windows/base/ (https://cran.rproject.org/bin/windows/base/)(Linux) https://cran.rproject.org/bin/linux/ (https://cran.rproject.org/bin/linux/)(Mac OSX) https://cran.rproject.org/bin/macosx/ (https://cran.rproject.org/bin/macosx/)
After installing R, install R Studio:
(Windows) https://download1.rstudio.org/RStudio1.0.143.exe (https://download1.rstudio.org/RStudio1.0.143.exe)(Windows no admin access, R required) https://download1.rstudio.org/RStudio1.0.143.zip(https://download1.rstudio.org/RStudio1.0.143.zip)(Ubuntu 12.04+/Debian 6+ 32bit) https://download1.rstudio.org/rstudio1.0.143i386.deb(https://download1.rstudio.org/rstudio1.0.143i386.deb)(Ubuntu 12.04+/Debian 6+ 64bit) https://download1.rstudio.org/rstudio1.0.143amd64.deb(https://download1.rstudio.org/rstudio1.0.143amd64.deb)(Fedora 19+/RedHat7+/openSUSE 13.1+ 32bit) https://download1.rstudio.org/rstudio1.0.143i686.rpm(https://download1.rstudio.org/rstudio1.0.143i686.rpm)(Fedora 19+/RedHat7+/openSUSE 13.1+ 64bit) https://download1.rstudio.org/rstudio1.0.143x86_64.rpm(https://download1.rstudio.org/rstudio1.0.143x86_64.rpm)(Mac OSX) https://download1.rstudio.org/RStudio1.0.143.dmg (https://download1.rstudio.org/RStudio1.0.143.dmg)
The RStudio default environmentThe default Rstudio environment is divided in 4 panels: (1) Source, (2) Console, (3) Workspace and History, and (4)Files, Plots, Packages and Help, as shown in the figure below:
Source: When you first start Rstudio, panel 1 might not be displayed until you open a .R script or create a new one.Use the menu on the top or navigate using panel 4 to open an existing script. To create a new filie, use the top menuFile New... R Script to create a new one. You may also choose to start a project instead.
Console: If you have used R before, this is the console you are familiar with. Try the following exercises to get used tothe console, and the communication between panels 1 and 2.
Exercise 1: (Using the console)
Step 1: Click on Panel 2, you should see your cursor blinking in front of one of the arrows > .Step 2: Type in 45.233.69 and press Enter.
You should see the following in your console (check panel 2).
In [1]: 45.23‐3.59
Exercise 2: (Using a script to send commands to the console)
Step 1: Create a new script (File New... R Script) or open an existing one.Step 2: Click on Panel 1 and type in 45.23 3.59.Step 3: Make sure that the cursor is on the line you just typed and press Ctrl+Enter.
You should see exactly the same as in Exercise 1.
→ →
→ →
41.64
Exercise 3: (Using the workspace)
When you create new objects in R, they get stored in your workspace. In RStudio, you can see your current workspacein Panel 3. You can also load previously saved workspaces from this panel.
Either using a script or the console, run the command
a = 45.23 3.59.
You should see the following in your console:
In [2]: a = 45.23 ‐ 3.59
This time the result isn't displayed in the console as in Exercises 1 and 2, instead the value is stored in an object calleda. This object will now appear in your workspace.
Note: The command a = 45.23 3.59 is equivalent to a < 45.23 3.59.
You can also use the command ls() to list all variables currently in your workspace. The command cat allows you todisplay a variable in the console.
In [3]: ls() cat("a = ",a)
To delete a, type in rm(a).
In [4]: rm(a)
Exercise 4: (Commenting)
To create a comment, you must start the commented text with a hash symbol (#). In RStudio, commented text appearsin green. Type
#45.23 3.59
in the console in panel 2 and press Enter. Unlike in Exercise 1, R will ignore your command.
In [5]: #45.23 ‐ 3.59
Using R as an ordinary calculator
At the > prompt in the console (or a script), you can type numerical expressions as you would into a calculator, hitEnter (or press Ctrl+Enter), and R will print the answer. Try it yourself by using R to compute the following operations:
Sum of 45.23 and 3.59Difference of 45.23 and 3.59
'a'
a = 41.64
In [6]: 45.23 + 3.59
In [7]: 45.23 ‐ 3.59
Multiplication is represented by *, division by / , and powers can be obtained using ^ . More complicated mathematicaloperations are represented by functions (e.g. exp, sqrt, log, sin).
In [8]: 45.23 * 3.59
In [9]: 45.23/3.59
In [10]: 45.23^3.59
In [11]: log(3.59)
To open the help file for an object or function, type ? followed by the object's name. For example:
In [12]: ?log
This will display the help file for the log function in the package base entitled Logarithms and Exponentials. If youdon't know the name of the object you are looking for, you may use the command help.search. Try runnning thecommand: help.search("logarithm").
Vectors and ObjectsIn R, objects can be anything from a scalar, to a function or a data frame. In Exercise 3 above, we created the object a(and later deleted it). Let's recreate this object and assign the value 2 to it.
In [13]: a = 2 a
The object a is a vector of length 1. To create a vector, we use the c function. For example, to create the vector , we write:
In [14]: c(1, 2, 5)
(1, 2, 5)
48.82
41.64
162.3757
12.5988857938719
876963.362664197
1.27815220250019
2
1 2 5
Scalar operations
Most scalar operations can be applied to a vector elementwise. For example:
In [15]: 2*c(1,2,5)
In [16]: c(1,2,5)/2
In [17]: log(c(1,2,5))
Creating sequences
In order to create sequences, we use the seq function. The most commonly used arguments for seq are: from, to, by,and length.out. For example, to create the sequence of numbers 1 to 10, we write:
In [18]: seq(from = 1, to = 10, by = 1) seq(from = 1, to = 10) # equivalent ‐ by = 1 is default seq(1, 10) # shorthand 1:10 #very shorthand
Say we now want to create a vector with all odd numbers between 1 and 10, we write:
In [19]: seq(from = 1, to = 10, by = 2) seq(1, 10, 2) #shorthand
Finally, to create a vector of evenly spaced numbers in a given interval, we use length.out. Try changing the outputlength below.
In [20]: seq(from = 1, to = 10,length.out = 5) seq(1, 10, length.out = 5) #shorthand
Exercise 5: Create the vector and call it v1.(2, 4, 6)
2 4 10
0.5 1 2.5
0 0.693147180559945 1.6094379124341
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 3 5 7 9
1 3 5 7 9
1 3.25 5.5 7.75 10
1 3.25 5.5 7.75 10
In [22]:
Exercise 6: Create the same vector using the object a and without typing the numbers 4 or 6.
In [ ]:
Exercise 7: Create a vector containing all integers from 1 to 100 and call it v2.
In [23]:
Concatenating and subsetting
You can also concatenate multiple vectors using the command c. For example:
In [24]: c(a, v1, v2)
To retrieve an element from a vector, we use single square brackets, '[]', as follows *(Note: ):
In [25]: v1[2] # second element from vector v1
To retrieve a subset of elements, we can feed a vector of indices within the square brackets. For example, to retrievethe first 5 elements of v2 we can do any of the following:
In [26]: v2[c(1, 2, 3, 4, 5)] v2[seq(1, 5)] v2[1:5]
We can also use subsetting to modify elements within a vector:
In [27]: v1[2] = 10 v1
Vectors of strings and object types
v1 = (2, 4, 6)
2 2 4 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 3940 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 5960 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 7980 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100
4
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
2 10 6
Exercise 8: Create a vector containing your forename, middle names, and surname as a string. Hint: A string is simplyan ordered collection of characters surrounded by quotation marks 'someCharacters' (either single ', or double quotes" can be used).
In [ ]:
Now try concatenating the vector a and the character 'a' as follows:
In [28]: c(a,'a')
We can check the type of an object using the function typeof. For example:
In [29]: typeof(2)
In [30]: typeof(a)
In [31]: typeof('a')
In [32]: typeof(c(a,v1,v2))
In [33]: typeof(c(a,'a'))
Here we see that concatenating a 'double' object and a 'character' object, results in a vector of 'characters'. This canoften be an undesirable effect when both data types need to be kept. We can convert the 'character' vector back tonumeric using the as.numeric function:
In [34]: as.numeric(c(a,'a'))
Note that a warning was created when trying to convert the character 'a', and 'a' was replaced by a NA (not availableor missing value). However, the type of the vector generated by as.numeric(c(a,'a')) is now of type 'double' instead of'character'.
Lists
'2' 'a'
'double'
'double'
'character'
'double'
'character'
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
2 NA
A list is a vector containing other objects of one or more types. Let's start by creating objects of multiple types:
In [35]: v = c(1, 0.5, pi) #vector of numbers ‐ double st = c('a', 'b', 'cat', 'R') #vector of strings ‐ character lg = c(TRUE, FALSE, FALSE, TRUE, TRUE) #vector of logical lstx = list(v, st, lg)
In [36]: lstx
To retrieve a list slice, we use single square brackets, '[ ]', the same way we did when subsetting vectors:
In [37]: lstx[1] typeof(lstx[1])
To retrieve the vector within the list, we use double square brackets, '[[ ]]':
In [38]: lstx[[1]] typeof(lstx[[1]])
Similar to vectors, we can modify the contents of a list directly:
In [39]: lstx[[1]][2] = 200 lstx
Sublists in a list can be named and addressed by their corresponding names. Let's recreate lstx:
In [40]: lstx = list(v = v, st = st, lg = lg) lstx
1. 1 0.5 3.141592653589792. 'a' 'b' 'cat' 'R'3. TRUE FALSE FALSE TRUE TRUE
1. 1 0.5 3.14159265358979
'list'
1 0.5 3.14159265358979
'double'
1. 1 200 3.141592653589792. 'a' 'b' 'cat' 'R'3. TRUE FALSE FALSE TRUE TRUE
$v1 0.5 3.14159265358979
$st'a' 'b' 'cat' 'R'
$lgTRUE FALSE FALSE TRUE TRUE
We can identify the names of the variables in a list by using the function names. In this example, we type:
In [41]: names(lstx)
Now we can retrieve members of lstx in two different ways:
In [42]: lstx[[1]] lstx$v
DataframesDatasets in R are often stored in special types of lists called dataframes (similar to tables in Excel and Matlab).Dataframes are matrixlike blocks of values where each column represents a variable, and each row represents acase. Unlike generic lists, each sublist must contain the same number of elements. Here is a simple way of creating adataframe:
In [43]: df = data.frame(Hospital=c("Darlington Memorial","James Cook","South Tyne"), Beds=c(463,1010,394))
In [44]: df
Note: In R Studio, you can view your dataframe in a more friendly format by clicking on it from your workspace, or bytyping View(df) in the console.
We can identify the names of the variables in a dataframe by using the function names. In this example, we type:
In [45]: names(df)
We can reference each variable by its name or by its column number as follows:
In [46]: df$Beds df[,2]
'v' 'st' 'lg'
1 0.5 3.14159265358979
1 0.5 3.14159265358979
Hospital Beds
Darlington Memorial 463
James Cook 1010
South Tyne 394
'Hospital' 'Beds'
463 1010 394
463 1010 394
To retrieve a specific element, say the number of beds in South Tyne, we can do either of the following:
In [47]: df[3,2] #3rd element from column 2 df$Beds[3] #3rd element from the Beds variable
Subsetting dataframes
The function subset can be used to easily filter dataframes. For example, to create a dataframe containing only theJames Cook hospital entry, we can type:
In [48]: subset(df,Hospital == "James Cook")
Note: To read more about logical operators go to: https://www.rbloggers.com/logicaloperatorsinr/ (https://www.rbloggers.com/logicaloperatorsinr/)
Exercise 9: Create a subset of df containing hospitals with 1000 beds or fewer.
In [ ]:
We can add new columns to a dataframe as follows:
In [49]: df$Postcode = c("DL3 6HX","TS4 3BW","NE34 0PL") df
Adding a new row is slightly more cumbersome. We do so by concatenating two dataframes using the row bindingcommand rbind.
394
394
Hospital Beds
2 James Cook 1010
Hospital Beds Postcode
Darlington Memorial 463 DL3 6HX
James Cook 1010 TS4 3BW
South Tyne 394 NE34 0PL
In [50]: df = rbind(df,data.frame(Hospital="Friarage",Beds=178,Postcode="DL6 1JG")) df
Note: there are a number of libraries designed to facilitate data manipulation in R (e.g. dplyr, reshape2). For theexamples presented so far, we are using only base functions.
Importing and exporting data
Data can be imported and exported from R in a number of formats. Standard delimited formats such as .csv can beimported using base functions, while more advanced formats such as jSon and XML can be imported using dedicatedlibraries. Such libraries also exist to import files produced in SPSS, SAS, and other commercial statistical software. Inthis section, we will focus on importing and exporting to and from .csv, and .xlsx.
Let's start by importing the dataset IMD_housing_data_partial.csv. This table contains basic information aboutpopulation distribution and some basic statistics for 112 towns and cities. We will use the command read.csv to importthis table into the variable Census. Note that the first line of this csv only contains the table name, the table starts inline 2.
In [56]: Census = read.csv("IMD_housing_data_partial.csv", skip = 1, strip.white = TRUE) #Use View in RStudio or click on the variable to see the full data set
In [57]: names(Census)
First let's check the size of our dataframe using the function dim.
In [58]: dim(Census) # the dataset has 116 rows and 19 columns
Hospital Beds Postcode
Darlington Memorial 463 DL3 6HX
James Cook 1010 TS4 3BW
South Tyne 394 NE34 0PL
Friarage 178 DL6 1JG
'TCITY15CD' 'Town.City' 'Region.Country' 'Population.aged.0.15''Population.aged.16.64' 'Population.aged.65.' 'Population.aged.85.''Population..limited.a.lot..by.a.health.problem.or.disability..aged.16.64''Population..limited.a.little..by.a.health.problem.or.disability..aged.16.64''Population..not.limited..by.a.health.problem.or.disability..aged.16.64' 'Households.owned''Households.privately.renting' 'Households.socially.renting' 'Other.households''Proportion.of.Full.Time.Students..aged.16.74.''Proportion.of.resident.population.with.no.qualifications..aged.16.''Proportion.of.resident.population.with.Level.four.and.above.qualifications..aged.16..''Proportion.of.workday.population.in.Manufacturing..C..Industry..aged.16.74.''Proportion.of.workday.population.in.Wholesale.and.Retail.Trade..G..Industry..aged.16.74.''Proportion.of.workday.population.in.Professional..Finance.and.Information..J.K.M..Industry..aged.16.74.''Proportion.of.workday.population.in.Public.Admin..Health.and.Education..O.P.Q..Industry..aged.16.74.''Net.in.commuting..aged.16.74' 'Net.in.commuting.residents.in.employment..aged.16.74'
113 23
The headers in a dataframe can be modified by replacing elements in the vector names(vec). For example:
In [59]: names(Census)[4] = "Pop.aged.0_15" names(Census)[5:7] = c("pop.aged.16_64","pop.aged.65plus","pop.aged.85plus") names(Census)[15:21]=c("Prop.FT.students.16_74","Prop.no.qual.16_74","Prop.lev4plus.16_74","Prop.manuf.16_74", "Prop.retail.16_74","Prop.professional.16_74","Prop.health.and.edu.16_74") names(Census)
Now we will remove the columns we are not interested in at the moment, in this case, columns 8 to 14, 22, and 23.
In [60]: Census[,8:14] = NULL Census$Net.in.commuting..aged.16.74 = NULL Census = Census[,‐ncol(Census)] dim(Census) names(Census)
Installing and loading libraries
Now let's import a .xlsx containing the full dataset in the file IMD_housind_data_full.xlsx. Here we have tablesextracted from the ONS article "Towns and cities analysis, England and Wales, March 2016" with populationbreakdowns by age, town/city, IMD, housing prices and other details.
There are a number of libraries that can be used to import .xlsx packages, here we will use openxlsx. R libraries andpackages can be installed using the install.packages command. For example, to install openxlsx we type:
install.packages("openxlsx")
and then load by typing:
In [62]: library("openxlsx")
'TCITY15CD' 'Town.City' 'Region.Country' 'Pop.aged.0_15' 'pop.aged.16_64''pop.aged.65plus' 'pop.aged.85plus''Population..limited.a.lot..by.a.health.problem.or.disability..aged.16.64''Population..limited.a.little..by.a.health.problem.or.disability..aged.16.64''Population..not.limited..by.a.health.problem.or.disability..aged.16.64' 'Households.owned''Households.privately.renting' 'Households.socially.renting' 'Other.households''Prop.FT.students.16_74' 'Prop.no.qual.16_74' 'Prop.lev4plus.16_74''Prop.manuf.16_74' 'Prop.retail.16_74' 'Prop.professional.16_74''Prop.health.and.edu.16_74' 'Net.in.commuting..aged.16.74''Net.in.commuting.residents.in.employment..aged.16.74'
113 14
'TCITY15CD' 'Town.City' 'Region.Country' 'Pop.aged.0_15' 'pop.aged.16_64''pop.aged.65plus' 'pop.aged.85plus' 'Prop.FT.students.16_74' 'Prop.no.qual.16_74''Prop.lev4plus.16_74' 'Prop.manuf.16_74' 'Prop.retail.16_74' 'Prop.professional.16_74''Prop.health.and.edu.16_74'
In [63]: wb = loadWorkbook("IMD_housing_data_full.xlsx") lsdf = lapply(1:length(wb$sheet_names),function(i){read.xlsx(wb,sheet = i,startRow = 2)}) names(lsdf) = wb$sheet_names
To merge two tables, we can use the function merge. Here we first merge the tables containing the Census data andthe Housing data, then merge the result with the IMD data. We will use the variable TCITY15CD as the uniqueidentifier for merging.
In [83]: temp = merge(lsdf$Census,lsdf$Housing, by = "TCITY15CD", all = TRUE) data.demo = merge(temp, lsdf$IMD, by = "TCITY15CD", all = TRUE) # Renaming columns and removing unnecessary columns as seen before names(data.demo)[4:7] = c("Pop.aged.0_15","pop.aged.16_64","pop.aged.65plus","pop.aged.85plus") names(data.demo)[15:21]=c("Prop.FT.students.16_74","Prop.no.qual.16_74","Prop.lev4plus.16_74","Prop.manuf.16_74", "Prop.retail.16_74","Prop.professional.16_74","Prop.health.and.edu.16_74") names(data.demo)[c(27,32,46,52,56,44)] = c("median.house.price2015","median.house.price1995","IMD.prop.most.deprived", "health.prop.most.deprived","crime.prop.most.deprived","no.lsaos") names(data.demo)[2:3] = c("Town.City","Region.Country") data.demo = data.demo[,c(1:5,15:21,27,32,44,46,52,56)]
Summary and descriptive statistics
R has a large range of tools for calculating summary and descriptive statistics. The most frequently used functionssuch as mean, median, variance, quantiles/percentiles, and range are included in the base and stats libraries. Morespecialized functions can be found in libraries (or you can write your own).
Mean: mean(x)Variance: var(x)Standard Deviation: sd(x)Minimum and Maximum (Rnage): range(x)Quantiles/Percentiles: Iquantile(x, vec) where vec is a vector of quantiles/percentiles in .Median: median(x)
Let's test them using the variable Pop.aged.0_15 in the data.demo dataframe. Check the help file for the statspackage by typing ?stats.
(0, 1)
In [84]: mean(data.demo$Pop.aged.0_15) # Mean var(data.demo$Pop.aged.0_15) # Variance sd(data.demo$Pop.aged.0_15) # Standard deviation range(data.demo$Pop.aged.0_15) # Range quantile(data.demo$Pop.aged.0_15,c(0.05,0.95)) # 5% and 95% quantiles median(data.demo$Pop.aged.0_15) # Median
We can also use the summary function to retrieve basic information about dataframes, variables, models. As we willsee later, this is a fairly versatile function that can be applied to most object types.
In [ ]: summary(data.demo)
Linear Regression Models
We will be building linear models to predict the median house prices in 2015, using the census, deprivation andhousing data we have uploaded.
To start, we fit a simple linear model, so that just one independent (or predictor) variable is used. In this case we modelmedian house price using the proportion of adults with a level 4 plus qualification.
In [86]: lm_lev4plus = lm(median.house.price2015 ~ Prop.lev4plus.16_74, data=data.demo[complete.cases(data.demo),]) #complete.cases remove rows with missing data
We can inspect the linear model object lm_lev4plus using the summary function as follows:
In [ ]: summary(lm_lev4plus)
19.2625744737809
4.36007670940509
2.08807967027245
15.1380488005596 26.1656494886742
5%95%
16.0401690606223.2136786041414
19.3079434412125
The Estimate column of the Coefficients table shows us the value of the intercept and of the coefficient ofProp.lev4plus.16_74.
So our model is:
The column shows the significance of each coefficient under the assumptions of the linear model. Thesmaller this number, the more statistically significant the coefficient.
The Adjusted Rsquared value towards the bottom is an estimate of the proportion of variation in the dependentvariable that is explained by our model.
We can plot the fitted values from the model against the true 2015 median house price, and add a line with intercept 0,slope 1 to assess the quality of our fit.
In [92]: plot(fitted(lm_lev4plus) ~ lm_lev4plus$model$median.house.price2015, pch=20, col = 'blue') #produce simple scatterplot abline(a=0, b=1, col = 'darkgray') #add line
We can also look at the residuals:
median. house. pric = −25065.6 + 8125.8 × Prop. lev4plue2015 s16−74
Pr(> |t|)
In [106]: par(mfrow=c(1,2)) # split your plotting area ‐ alternatively use layout plot(resid(lm_lev4plus), pch = 20, col = 'cadetblue') abline(h=0, col = 'darkslateblue') hist(resid(lm_lev4plus), col = 'cornflowerblue')
We'll build a second simple linear model, this time using the median house price in 1995 to predict the median houseprice in 2015.
In [107]: lm_med95 = lm(median.house.price2015 ~ median.house.price1995, data=data.demo[complete.cases(data.demo),]) summary(lm_med95)
Call: lm(formula = median.house.price2015 ~ median.house.price1995, data = data.demo[complete.cases(data.demo), ]) Residuals: Min 1Q Median 3Q Max ‐78936 ‐10848 ‐1006 10936 112877 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ‐8.650e+04 9.871e+03 ‐8.763 3.16e‐14 *** median.house.price1995 5.372e+00 1.994e‐01 26.946 < 2e‐16 *** ‐‐‐ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 24250 on 107 degrees of freedom Multiple R‐squared: 0.8716, Adjusted R‐squared: 0.8704 F‐statistic: 726.1 on 1 and 107 DF, p‐value: < 2.2e‐16
In [108]: plot(fitted(lm_med95) ~ lm_med95$model$median.house.price2015, pch = 22, col = 'darkred') abline(a=0, b=1, col = 'coral')
We can compare the two models by plotting in the same window:
In [128]: par(mfrow=c(2,3)) plot(fitted(lm_lev4plus) ~ lm_lev4plus$model$median.house.price2015, col = rgb(0,0.5,0.5)) abline(a=0, b=1, col = rgb(0,0.9,0.5)) plot(resid(lm_lev4plus), col = rgb(0.5,0,0)) abline(h=0, col = rgb(1,0,0)) hist(resid(lm_lev4plus), col = rgb(0.4,0.4,1))
In [133]: par(mfrow=c(2,3)) plot(fitted(lm_med95) ~ lm_med95$model$median.house.price2015, col = rgb(0,0.5,0.5)) abline(a=0, b=1, col = rgb(0,0.9,0.5)) plot(resid(lm_med95), col = rgb(0.5,0,0)) abline(h=0, col = rgb(1,0,0)) hist(resid(lm_med95), col = rgb(0.4,0.4,1))
We can build a model using both independent variables by altering the formula
In [ ]: lm_both = lm(median.house.price2015 ~ median.house.price1995 + Prop.lev4plus.16_74 , data=data.demo[complete.cases(data.demo),]) summary(lm_both)
Using the commands we have used for our previous linear models, we find that our model is
We can build a linear model including all terms using shorthand. In this case we remove the first two columns sincethey are row identifiers. We also specify first that the Region.Country variable should be a factor variable, rather thana string.
median. house. pric = −90040 + 4.17 × median. house. pric + 2532 × Prop. lev4plue2015 e1995 s16−74
In [ ]: data.demo$Region.Country = as.factor(data.demo$Region.Country) data.demo = data.demo[,‐6] lm_all = lm(median.house.price2015 ~ ., data=data.demo[complete.cases(data.demo), ‐c(1,2)]) summary(lm_all)
Looking at the summary shows that this model explains around 96% of the variation in median.house.price2015, withsome independent variables proving more significant than others.
'East Midlands' is taken as the base case for the Region.Country variable, and a coefficient is added/subtracted for anyother value. We can compare the relative importance of the independent variables using the package relaimpo.
In [137]: library("relaimpo")
In [ ]: calc.relimp(lm_all, type="lmg") #need to fix
The 'Relative importance metrics' table near the top of the output shows the relative importance of the terms in themodel, by estimating how each term contributes to the Rsquared value.
Using the function step, R can add and remove terms to improve the model. Starting with lm_all and removing terms:
In [ ]: lm_step_backward = step(lm_all, scope=list(lower= ~1), direction = "backward", k=log(109))
Setting ensures that step uses the Bayesian information criterion (BIC) to assess the models.
Adding and removing terms, allowing second order interactions:
In [ ]: lm_step_both = step(lm_step_backward, scope=list(lower = ~1, upper = ~.^2), direction="both", k=log(109))
We can use the same steps as before to inspect this model
k = log(109)
In [141]: summary(lm_step_both) plot(fitted(lm_step_both)~lm_step_both$model$median.house.price2015) hist(resid(lm_step_both)) plot(resid(lm_step_both))
Call: lm(formula = median.house.price2015 ~ Region.Country + Prop.no.qual.16_74 + Prop.lev4plus.16_74 + median.house.price1995 + Prop.no.qual.16_74:Prop.lev4plus.16_74, data = data.demo[complete.cases(data.demo), ‐c(1, 2)]) Residuals: Min 1Q Median 3Q Max ‐22962 ‐6652 103 6548 38178 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ‐1.056e+05 2.852e+04 ‐3.703 0.000355 Region.CountryEast of England 3.769e+04 6.352e+03 5.934 4.68e‐08 Region.CountryLondon 8.222e+04 1.556e+04 5.285 7.86e‐07 Region.CountryNorth East ‐6.409e+03 6.865e+03 ‐0.934 0.352853 Region.CountryNorth West ‐1.448e+04 5.836e+03 ‐2.481 0.014828 Region.CountrySouth East 3.188e+04 6.557e+03 4.861 4.55e‐06 Region.CountrySouth West 1.739e+04 6.826e+03 2.548 0.012426 Region.CountryWest Midlands ‐6.512e+03 6.260e+03 ‐1.040 0.300800 Region.CountryYorkshire and The Humber ‐1.037e+04 6.072e+03 ‐1.708 0.090936 Prop.no.qual.16_74 3.708e+03 7.727e+02 4.798 5.86e‐06 Prop.lev4plus.16_74 6.489e+03 6.757e+02 9.602 1.06e‐15 median.house.price1995 2.599e+00 2.337e‐01 11.122 < 2e‐16 Prop.no.qual.16_74:Prop.lev4plus.16_74 ‐1.850e+02 3.913e+01 ‐4.728 7.78e‐06 (Intercept) *** Region.CountryEast of England *** Region.CountryLondon *** Region.CountryNorth East Region.CountryNorth West * Region.CountrySouth East *** Region.CountrySouth West * Region.CountryWest Midlands Region.CountryYorkshire and The Humber . Prop.no.qual.16_74 *** Prop.lev4plus.16_74 *** median.house.price1995 *** Prop.no.qual.16_74:Prop.lev4plus.16_74 *** ‐‐‐ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 12890 on 96 degrees of freedom Multiple R‐squared: 0.9674, Adjusted R‐squared: 0.9634 F‐statistic: 237.7 on 12 and 96 DF, p‐value: < 2.2e‐16
Exercise 11: Split the demo.data in two dataframes, set1Fit and setPred containing roughly 75% and 25% ofcomplete cases respectively.
Exercise 12: Fit a linear model to predict the median house price for 2015 using any number of predictors that youfind appropriate using only the setFit dataframe. Name your linear model predMed.
Exercise 13: Use the command predict (read its help file first) to estimate the median house prices for the localities insetPred. Compare your estimates to the real values observed.