A Practical Introduction to Quantitative Methods and R · A Practical Introduction to Quantitative Methods and R Get Started with R and Hands on with Statistical Learning and Quantitative

A Practical Introduction toQuantitative Methods and RGet Started with R and Hands on with Statistical Learning and

Quantitative Modelling

Dr. Philippe J.S. De Brouwer

last compiled: October 1, 2018

Hammers work fine on nails, not so much on screws. So candescriptive and inferential statistics do a lot for us, but only un-derstanding the practical implementation helps us to choose theright tool.

Contents

I An Introduction to statistics with R 7

1 Introduction 9

2 The Basics of R 132.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Basic Data Types . . . . . . . . . . . . . . . . . . . . . 182.2.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.4 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.5 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.6 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.7 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.3.1 Arithmetic Operators . . . . . . . . . . . . . . . . . . . 382.3.2 Relational Operators . . . . . . . . . . . . . . . . . . . 382.3.3 Logical Operators . . . . . . . . . . . . . . . . . . . . . 392.3.4 Assignment Operators . . . . . . . . . . . . . . . . . . 402.3.5 Other Operators . . . . . . . . . . . . . . . . . . . . . 402.3.6 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3.7 Functions . . . . . . . . . . . . . . . . . . . . . . . . . 432.3.8 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . 462.3.9 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4 Selected Data Interfaces . . . . . . . . . . . . . . . . . . . . . 522.4.1 CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . 522.4.2 Excel Files . . . . . . . . . . . . . . . . . . . . . . . . . 562.4.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.5 Charts & Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 582.5.1 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . 58

CONTENTS

2.5.2 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . 592.5.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . 612.5.4 Histograms . . . . . . . . . . . . . . . . . . . . . . . . 642.5.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . 652.5.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . 682.5.7 Plotting Functions . . . . . . . . . . . . . . . . . . . . 70

2.6 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.6.1 Normal Distribution . . . . . . . . . . . . . . . . . . . 722.6.2 Binomial Distribution . . . . . . . . . . . . . . . . . . 74

2.7 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . 772.7.1 Time Series in R . . . . . . . . . . . . . . . . . . . . . . 772.7.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . 79

2.7.2.1 Moving Average . . . . . . . . . . . . . . . . 802.7.2.2 Seasonal Decomposition . . . . . . . . . . . 86

3 Elements of Descriptive Statistics 913.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . 91

3.1.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913.1.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.1.3 Arithmetic Mode . . . . . . . . . . . . . . . . . . . . . 95

3.2 Measures of Variation or Spread . . . . . . . . . . . . . . . . 963.3 Measures of Covariation . . . . . . . . . . . . . . . . . . . . . 983.4 Chi Square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4 Elements of Inferential Statistics 1014.1 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 101

4.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . 1014.1.2 Multiple Linear Regression . . . . . . . . . . . . . . . 1044.1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . 1064.1.4 Poisson Regression . . . . . . . . . . . . . . . . . . . . 1074.1.5 Non-Linear Regression . . . . . . . . . . . . . . . . . . 110

4.2 The Model Performance . . . . . . . . . . . . . . . . . . . . . 1134.2.1 Mean Square Error (MSE) . . . . . . . . . . . . . . . . 1134.2.2 R-Squared . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2.3 Mean Average Deviation (MAD) . . . . . . . . . . . . 1154.2.4 The performance of binary classification models . . . 115

4.2.4.1 AUC for logistic regression . . . . . . . . . . 1154.2.5 AUC Gini for logistic regression . . . . . . . . . . . . 1204.2.6 Kolmogorov-Smirnov (KS) for logistic regression . . 120

4.2.6.1 The package ROCR . . . . . . . . . . . . . . 1214.3 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . 1264.4 Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . 1304.5 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.5.1 Essential Background . . . . . . . . . . . . . . . . . . 131

Philippe De Brouwer — 4 —

CONTENTS

4.5.2 Important considerations . . . . . . . . . . . . . . . . 1374.5.3 Growing trees with R . . . . . . . . . . . . . . . . . . . 1394.5.4 Evaluating the performance of a decision tree . . . . 151

4.5.4.1 The performance of the regression tree . . . 1514.5.4.2 The performance of the classification tree . 151

4.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 1554.7 Artificial Neural Networks (ANN) . . . . . . . . . . . . . . . 160

4.7.1 The basics of ANNs in R . . . . . . . . . . . . . . . . . 1604.7.2 An example of a work-flow to develop an ANN . . . 165

4.8 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 1744.9 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5 Data Visualization Methods 1775.1 Heat-maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.2 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.2.1 Word Clouds . . . . . . . . . . . . . . . . . . . . . . . 1805.2.2 Word Associations . . . . . . . . . . . . . . . . . . . . 184

6 Examples 1856.1 Financial Analysis with QuantMod . . . . . . . . . . . . . . . 185

6.1.1 The quantmod data structure . . . . . . . . . . . . . . 1906.1.2 Support functions supplied by quantmod . . . . . . . 1936.1.3 Financial modeling in quantmod . . . . . . . . . . . . 195

6.2 predicting Awards . . . . . . . . . . . . . . . . . . . . . . . . 203

II Appendices 213

7 Other Resources 215

8 Levels of Measurement 2178.1 Nominal Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 2188.2 Ordinal Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2198.3 Interval Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 2208.4 Ratio Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

References 223

Bibliography 225

Index 227

Nomenclature 235

— 5 — Philippe De Brouwer

CONTENTS


PART I

An Introduction to statistics withR

PART I .:. CHAPTER 1 .:.

Introduction

The world around us is constantly changing, making the wrong decisionscan be disastrous for a business. At the same time it is more than everimportant to innovate. Innovating is having an idea and trying it with therisk of failure.

There are many ways to come to a view on what to do, some of themost popular method is instinct and prejudices juiced with psychologi-cal biases. There are also numerous ways ways to come to a conclusion:some of the most popular methods include decision by authority (“let theboss decide”), deciding by decibels (“the loudest employee is heard”) anddogmatism (“we did this in the past” or “we have a procedure that saysthis”. While these methods of creating an opinion and deciding might co-incidently work out, in general they are sub-optimal by design. Indeed thebest solution might not even be considered and be ruled out based on thewrong arguments.

Looking at scientific development throughout times as well as humanhistory, one is compelled to conclude that the only workable construct sofar is also known as the scientific method. No other methods has broughtthe world so many innovations and progress.

The Scientific MethodUsually one recognizes Aristotle (384–322 BCE, Greece) as the father

of the scientific method because of his rigorous logical method which wasmuch more than natural logic. But it is fair to credit Ibn al-Haytham (akaAlhazen — 965–1039, Iraq) to prepare the scientific method for collabora-tive use. His emphasis on collecting empirical data and reproducibility ofresults laid the foundation for a scientific method that is much more suc-cessful. This method allows people to check each other and confirm orreject previous results. This is a lot stronger and more probable to lead totruly correct results.

However both the scientific method and the word “scientist” only cameinto common use in the 19th century and the scientific methods only be-

CHAPTER 1. INTRODUCTION

formulate a question (hypothesis)

��

design a test plan

��

collect data (experiment)

��

wrangle data

��

model data

��

draw a conclusion

��

publish and communicate

Figure 1.1: The steps in the scientific methods for the data scientist and mathemat-ical modeller, aka “quant”.

came the standard method in the 20th century. Therefore, it not come as asurprise that this became a period of inventions and development as neverseen before.

While previous inventions such as fire, agriculture, the wheel, bronzeand steel might not have followed explicitly the scientific method they cre-ated a society ready to embrace the scientific method and fuel an era ofmassive accelerated innovation and expansion. The internal combustionengine, electricity and magnetism fuelled the economic growth that cameto an end in 1929. The electronic computer brought us to the 21th centuryand now a new era of growth is being fuelled by big data, machine learn-ing, nanotechnology and maybe quantum computing.

As a scientist we have the responsibility to celebrate this tradition ofthe scientific method, use it to answer interesting questions and above allguard humanity of misuse of technology.

Indeed huge powers come with huge responsibility. Once an inventionis made, it is impossible to “un-invent” it. Once the atomic bomb exist, itcannot be forgotten, it is part of our knowledge forever. What we can do is


promote peaceful applications of quantum technology, such as sensors toopen doors, diodes, computers, photosynthesis, etc.

At this point machine learning might help to bring us to the electronicsingularity.

It is our responsibility to foresee as good as we can such dangers and doall what is in our power to avoid an extinction event. Many inventions hada dark side and have lead to more efficient ways of killing people, degener-ating the ozone layer, polluting our ecosystem for example. Humanity hashad many difficult times and very dark days, however never before wholehumanity got extinct. That would be the greatest disaster of all with norecovery possible.

So, the scientific method is important and while large organisationsstruggle to come up with new ideas and products often the question iswhat of the technology makes sense. Usually data is the answer.

Data, statistics and the scientific method are powerful tools. The com-pany that has the best data and uses its data best is the company that willbe the most adaptable to the changes and hence the one to survive. This isnot biological evolution, but guided evolution.

The role of the data-analyst in any company cannot be underestimated.


CHAPTER 1. INTRODUCTION



The Basics of R

In this book we will approach data and analytics from a practitioners pointof view and our tool of choice is R. When the author started to use theSoftware R, it was a growing in popularity.

R is in some sense a copy of the the S programming language (writ-ten in 1976 by Johbn Chambers at Bell Labs) with added lexical scopingsemantics. Usually code written in S will also run in R.

R is a recent language, it was only in 1992 that the project was startedby Ross Ihaka and Robert Gentleman at the University of Auckland, NewZealand. The first version was available in 1995 and a stable version wasonly available in 2000.

Now the R Development Core Team (of which Chambers is a member)develops R further and maintains the code. Since a few years Microsofthas embraced the project and provides MRAN (Microsoft R ApplicationNetwork). This package is also FOSS software and has some advantagesover standard R such as enhanced performance (eg. multi-thread support,the checkpoint package that makes results more reproducible).

R is . . .

• a programming language build for statistical analysis, graphics rep-resentation and reporting.

• an interpreted computer language which allows branching, looping,modular programming using functions as well as object oriented pro-gramming features.

The main features of R

• allows integration with the procedures written in the C, C++, .Net,Python or FORTRAN languages for efficiency.

CHAPTER 2. THE BASICS OF R

• is freely (under the GNU General Public License), and pre-compiledbinary versions are provided for various operating systems like Linux,Windows and Mac.

• simple and effective

• free and open

• has an effective data handling and storage facility,

• provides a suite of operators for calculations on arrays, lists, vectorsand matrices.

• provides a large, coherent and integrated collection of tools for dataanalysis.

• provides graphical facilities for data analysis and display either di-rectly at the computer or printing at the papers.

• allows you to stand on the shoulders of giants (eg. by using libraries)

• has a supportive on-line community

R is the most widely used statistics programming language and is usedfrom universities to business applications

• You need a working installation of R on your computer.

• R is available for Mac, Linux and Windows from https://cran.r-project.org/

• To start R, open the command line and type R (followed by enter)

• You should then get the command line prompt of R. It is of coursealso possible to use a graphical interface such as RStudio (see https://www.rstudio.com)

alternative

Use R online:

• https://www.tutorialspoint.com/execute_r_online.php

• http://www.r-fiddle.org


RStudioWhether you use standard R or MRAN, using RStudio will enhance

your performance and help you to be more productive. Rstudio is an IDEfor R and provides a console, editor with syntax-highlighting, a window toshow plots and some workspace management.

download from https://www.rstudio.com/

Basic arithmetic

#addition2 + 3#product2 * 3#power2**32ˆ3#logic2 < 3x <- c(1,3,4,3)x.mean <- mean(x)x.meany <- c(2,3,5,1)x+y

note that the white space is not important

the scan() function

x <- scan()

invites you to type all values of the vector one by one.In order to end: type enter without typing a number (ie. leave one

empty to end).

Batch mode

1. create a file test.R

2. add the content print("Hello World")

3. run the command line Rscript test.R

4. now, open R and run the command source("test.R")

5. add in the file



my_function <- function(a,b){

a + b}

6. now repeat step 4 and run my function(4,5)


2.1. VARIABLES

2.1 Variables

Valid VariablesVariables

• can contain letters as well as ” ” (underscore) and ”.” (dot)

• variables must start with a letter (that can be preceded with a dot

eg. my var.1 and my.Cvar. are valid, but myVar, my%var and 1.var arenot

Variable assignmentAssignment can be made left or right:

x.1 <- 5x.1 + 3 -> .xprint(.x)

## [1] 8

Useful functions for variables

# List all variablesls() # hidden variable starts with dotls(all.names = TRUE) # shows all

# Remove a variablerm(x.1) # removes the variable x.1ls() # x.1 is not there any morerm(list = ls()) # removes all variablesls()



2.2 Data Types

2.2.1 Basic Data Types

Basic Data TypesVariables are not declared as a data type, rather they are assigned a class

object

x <- TRUEclass(x)

## [1] "logical"

x <- 5Lclass(x)

## [1] "integer"

x <- 5.135class(x)

## [1] "numeric"

x <- 2.2 + 3.2iclass(x)

## [1] "complex"

x <- "test"class(x)

## [1] "character"

2.2.2 Vectors

Composed Data TypesVectors vs Lists

Vectors are lists of objects of the same type. vectors are declared withthe function c()


2.2. DATA TYPES

x <- c(2,2.5,4,6)y <- c("apple", "pear")class(x)

## [1] "numeric"

class(y)

## [1] "character"

a list of objects of different types is called a “list”.

# Create a list.list1 <- list(c(1,2,3),3.1415,sin)

# Print the list.print(list1)

## [[1]]## [1] 1 2 3#### [[2]]## [1] 3.1415#### [[3]]## function (x) .Primitive("sin")

More about lists can be found in Chapter 2.2.3 on page 21.

Accessing Elements of a vectorv <- c(1:5)# access elements via indexingv[2]

## [1] 2

v[c(1,5)]

## [1] 1 5

#via TRUE/FALSE:v[c(TRUE,TRUE,FALSE,FALSE,TRUE)]

## [1] 1 2 5

#leave out certain elements:v[c(-2,-3)]

## [1] 1 4 5



Vector arithmeticAll standard behaviour is element per element

v1 <- c(1,2,3)v2 <- c(4,5,6)# Standard arithmeticv1 + v2

## [1] 5 7 9

v1 - v2

## [1] -3 -3 -3

v1 * v2

## [1] 4 10 18

v1 / v2

## [1] 0.25 0.40 0.50

# Vector Recyclingv1 <- c(1,2,3,4)v2 <- c(1,2)v1+v2

## [1] 2 4 4 6

# because v2 became (1,2,1,2)

Vector sorting

v1 <- c(1,-4,2,0,pi)sort(v1)

## [1] -4.000000 0.000000 1.000000 2.000000 3.141593

v2 <- c("January", "February", "March", "April")sort(v2)

## [1] "April" "February" "January" "March"

sort(v2, decreasing=TRUE)

## [1] "March" "January" "February" "April"


2.2. DATA TYPES

Exercise: S&P500Question

The time series nottem (from the package “datasets” that is usuallyloaded when R starts) contains the temperatures in Notthinghamfrom 1920 to 1939 in Farenheit. Create a new object that containsa list of all temperatures in Celcius.

Note that nottem is a time-series object (see: Chapter 2.7 on page 77)and not a matrix. All its elements are addressed with nottam[n]where

n is between 1 and length(nottam).Remember that T (C) = 5

9(T (F )− 32).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.3 Lists

Lists: definitionDefinition 1 .:. List .:.

In R lists are objects which contain elements of different types (theycan mix numbers, strings, vectors, matrices, functions, boolean vari-ables and even lists).

# List is created using list() function.myList <- list("Approximation", pi, 3.14, c)print(myList)

## [[1]]## [1] "Approximation"#### [[2]]## [1] 3.141593#### [[3]]## [1] 3.14#### [[4]]## function (...) .Primitive("c")



Lists in R behave very much like objects in Object Oriented program-ming (OO).

Indeed it allows to structure the code around the data just as in OOprogramming.

Naming elements of lists

# create the listL <- list("Approximation", pi, 3.14, c)

# assign names to elementsnames(L) <- c("description", "exact", "approx","function")print(L)

## $description## [1] "Approximation"#### $exact## [1] 3.141593#### $approx## [1] 3.14#### $`function`## function (...) .Primitive("c")

#addressing elements of the named listprint(paste("The difference is", L$exact - L$approx))

## [1] "The difference is 0.00159265358979299"

print(L[3])

## $approx## [1] 3.14

print(L$approx)

## [1] 3.14

# however "function" was a reserved worda <- L$`function`(2,3,pi,5) # to access the function c(...)print(a)

## [1] 2.000000 3.000000 3.141593 5.000000

Merged lists are also lists


2.2. DATA TYPES

V1 <- c(1,2,3)L2 <- list(V1, c(2:7))L3 <- list(L2,V1)print(L3)

## [[1]]## [[1]][[1]]## [1] 1 2 3#### [[1]][[2]]## [1] 2 3 4 5 6 7###### [[2]]## [1] 1 2 3

print(L3[[1]][[2]][3])

## [1] 4

Add and delete list elements

L <- list("mystring", matrix(c(1,2,3,4),nrow=2))

#add an elementL <- list(L, c(1:10))

#delete an elementL[1] <- NULLprint(L[1])

## [[1]]## [1] 1 2 3 4 5 6 7 8 9 10

print(L[2])

## [[1]]## NULL

Convert list to vectors

L <- list(c(1:5),c(6:10))v1 <- unlist(L[1])v2 <- unlist(L[2])v2-v1

## [1] 5 5 5 5 5



The reason to convert lists back to vectors is performance or becausesome functions will expect vectors and will not work on lists.

2.2.4 Matrices

MatricesA matrix is a two-dimensional data set. The matrix() function offers

a convenient way to define it:

# Create a matrix.M = matrix( c(1:6), nrow = 2, ncol = 3, byrow = TRUE)print(M)

## [,1] [,2] [,3]## [1,] 1 2 3## [2,] 4 5 6

M = matrix( c(1:6), nrow = 2, ncol = 3, byrow = FALSE)print(M)

## [,1] [,2] [,3]## [1,] 1 3 5## [2,] 2 4 6

Naming rows and columns

rownames = c("row1", "row2", "row3", "row4")colnames = c("col1", "col2", "col3")M <- matrix(c(10:21), nrow = 4, byrow = TRUE,

dimnames = list(rownames, colnames))print(M)

## col1 col2 col3## row1 10 11 12## row2 13 14 15## row3 16 17 18## row4 19 20 21

Accessing data in a Matrix

M <- matrix(c(10:21), nrow = 4, byrow = TRUE)

#access one elementM[1,2]


2.2. DATA TYPES

## [1] 11

# second rowM[2,]

## [1] 13 14 15

# second columnM[,2]

## [1] 11 14 17 20

Matrix Arithmeticworks element by element

M1 <- matrix(c(10:21), nrow = 4, byrow = TRUE)M2 <- matrix(c(0:11), nrow = 4, byrow = TRUE)M1+M2

## [,1] [,2] [,3]## [1,] 10 12 14## [2,] 16 18 20## [3,] 22 24 26## [4,] 28 30 32

M1*M2

## [,1] [,2] [,3]## [1,] 0 11 24## [2,] 39 56 75## [3,] 96 119 144## [4,] 171 200 231

M1/M2

## [,1] [,2] [,3]## [1,] Inf 11.000000 6.000000## [2,] 4.333333 3.500000 3.000000## [3,] 2.666667 2.428571 2.250000## [4,] 2.111111 2.000000 1.909091

Question

Write a function for the dot-product for matrices. Add also somesecurity checks. Finally compare your results with the “%*%-operator” (or the function crossproduct()



#Example of the results that you should find:a <- c(1:3)a %*% a

## [,1]## [1,] 14

a %*% t(a)

## [,1] [,2] [,3]## [1,] 1 2 3## [2,] 2 4 6## [3,] 3 6 9

t(a) %*% a

## [,1]## [1,] 14

A <- matrix(1:9,nrow=3)A %*% a

## [,1]## [1,] 30## [2,] 36## [3,] 42

A %*% t(a) # this is bound to fail!

## Error in A %*% t(a): non-conformable arguments

A %*% A

## [,1] [,2] [,3]## [1,] 30 66 102## [2,] 36 81 126## [3,] 42 96 150

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


2.2. DATA TYPES

2.2.5 Arrays

Creating and accessing arraysArrays can be of any number of dimensions (matrices are always 2-

dimensional), but need to be of one data type. They can be created with thearray() function; this function takes a “dim” attribute which defines thenumber of dimension. While arrays are similar to lists, they have to be ofone class type (lists can consist of different class types).

In the below example we create an array with two elements which are3x3 matrices each.

# Create an array.a <- array(c('A','B'),dim = c(3,3,2))print(a)

#access one elementa[2,2,2]

#access one layera[,,2]

Naming array elements

# Create two vectorsv1 <- c(1,1)v2 <- c(10:13)row.names <- c("col1","col2")col.names <- c("R1","R2","R3")matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.a <- array(c(v1,v2),dim = c(2,3,2),

dimnames = list(row.names,col.names,matrix.names))

print(a)

Manipulating arraysM1 <- a[,,1]M2 <- a[,,2]M2

## R1 R2 R3## col1 1 10 12## col2 1 11 13



Applying functions over arraysAn efficient way to apply the same function over an array is the function

apply

Definition 2

apply(X, MARGIN, FUN, ...) with:

1. X: an array, including a matrix.

2. MARGIN: a vector giving the subscripts which the functionwill be applied over. E.g., for a matrix 1 indicates rows, 2 in-dicates columns, c(1, 2) indicates rows and columns. WhereX has named dimnames, it can be a character vector selectingdimension names.

3. FUN: the function to be applied: see Details. In the case of func-tions like +, backquoted or quoted

An example for apply()

x <- cbind(x1 = 3, x2 = c(4:1, 2:5))dimnames(x)[[1]] <- letters[1:8]apply(x, 2, mean, trim = .2)

## x1 x2## 3 3

col.sums <- apply(x, 2, sum)row.sums <- apply(x, 1, sum)rbind(cbind(x, Rtot = row.sums),

Ctot = c(col.sums, sum(col.sums)))

## x1 x2 Rtot## a 3 4 7## b 3 3 6## c 3 2 5## d 3 1 4## e 3 2 5## f 3 3 6## g 3 4 7## h 3 5 8## Ctot 24 24 48


2.2. DATA TYPES

2.2.6 Factors

FactorsFactors are the r-objects which hold a series of labels. It stores the vector

along with the distinct values of the elements in the vector as labels. Thelabels are always character irrespective of data type of the elements in theinput vector. They are useful in statistical modelling.

Factors are created using the factor() function.

# Create a vector containing all your observations.feedback <- c('good','good','bad','average','bad','good')

# Create a factor object.factor_feedback <- factor(feedback)

# Print the factor object.print(factor_feedback)

## [1] good good bad average bad good## Levels: average bad good

# Plot the histogram -- note the default order is alphabeticplot(factor_feedback)

# The nlevels function returns the number of levels.print(nlevels(factor_feedback))

## [1] 3

Ordering the factors

feedback <- c('good','good','bad','average','bad','good')factor_feedback <- factor(feedback,

levels=c("bad","average","good"))plot(factor_feedback)



average bad good

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 2.1: Factors

bad average good

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 2.2: Ordered factors


2.2. DATA TYPES

The function gl() can generate factors

Definition 3 .:. gl() .:.

gl(n, k, length = n*k, labels = seq len(n),ordered = FALSE)with

• n: the number of levels

• k: the number of replications (for each level)

• length (optional): an integer giving the length of the result

• labels (optional): a vector with the labels

• ordered: a boolean variable indicating whether the resultsshould be ordered

gl(3,2,,c("bad","average","good"),TRUE)

## [1] bad bad average average good good## Levels: bad < average < good

Exercise: cars

Question

Use the dataset mtcars (from the library MASS) and explore the dis-tribution of number of gears. Then focus on the transmission andcreate a factor-object with the words “automatic” and “manual” instead of the number 0 and 1.

Use the ?mtcars to find out the exact definition of the data.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exercise: cars-horsepower



Question

Use the dataset mtcars (from the library MASS) and explore the dis-tribution of the horsepower (hp). How would you proceed to makea factoring (eg. Low, Medium, High) for this attribute? Hint: Use thefunction cut().

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.7 Data Frames

Data FramesData frames are very useful for statistical modelling; they are objects

that contain data in a tabular way. Unlike a matrix in data frame each col-umn can contain different modes of data. For example, the first column canbe factorial, the second logical and the third numerical.

It is a composed data type consisting of a list of vectors of equal length.Data Frames are created using the data.frame() function.

# Create the data frame.test.data <- data.frame(

name = c("Piotr", "Pawel","Paula","Lisa","Laura"),gender = c("Male", "Male","Female", "Female","Female"),score = c(78,88,92,89,84),age = c(42,38,26,30,35)

)print(test.data)

## name gender score age## 1 Piotr Male 78 42## 2 Pawel Male 88 38## 3 Paula Female 92 26## 4 Lisa Female 89 30## 5 Laura Female 84 35

# the standard plot function on a data-frame ...plot(test.data)


2.2. DATA TYPES

name

1.0 1.2 1.4 1.6 1.8 2.0

●

●

●

●

●

●

●

●

●

●

30 35 40

12

34

5●

●

●

●

●

1.0

1.2

1.4

1.6

1.8

2.0

●●

●●●

gender

● ●

●●●

●●

● ● ●

●

●

●

●

●

●

●

●

●

●score

7882

8690

●

●

●

●

●

1 2 3 4 5

3035

40

●

●

●

●

●

●

●

●

●

●

78 82 86 90

●

●

●

●

● age

Figure 2.3: The standard plot for a data frame in R.

# ... coincides with the pairs() function# pairs(test.data)

Edit data in data-frames

de(x) # fails if x is not definedde(x <- c(NA)) # worksx <- de(x <- c(NA)) # will also save the changesdata.entry(x) # de is short for data.entryx <- edit(x) # use the standard editor (vi in *nix)

Get information about data frames

# Get the structure of the data frame:str(test.data)

## 'data.frame': 5 obs. of 4 variables:## $ name : Factor w/ 5 levels "Laura","Lisa",..: 5 4 3 2 1## $ gender: Factor w/ 2 levels "Female","Male": 2 2 1 1 1## $ score : num 78 88 92 89 84## $ age : num 42 38 26 30 35



# Get the summary of the data frame:summary(test.data)

## name gender score age## Laura:1 Female:3 Min. :78.0 Min. :26.0## Lisa :1 Male :2 1st Qu.:84.0 1st Qu.:30.0## Paula:1 Median :88.0 Median :35.0## Pawel:1 Mean :86.2 Mean :34.2## Piotr:1 3rd Qu.:89.0 3rd Qu.:38.0## Max. :92.0 Max. :42.0

# Get the first rows:head(test.data)


# Get the last rows:tail(test.data)


# Extract the column 2 and 4 and keep all rowstest.data.1 <- test.data[,c(2,4)]print(test.data.1)

## gender age## 1 Male 42## 2 Male 38## 3 Female 26## 4 Female 30## 5 Female 35

Add columns to a data-frame

# Expand the data frame, simply define the additional columntest.data$end_date <- as.Date(c("2014-03-01", "2017-02-13",

"2014-10-10", "2015-05-10","2010-08-25"))print(test.data)


2.2. DATA TYPES

## name gender score age end_date## 1 Piotr Male 78 42 2014-03-01## 2 Pawel Male 88 38 2017-02-13## 3 Paula Female 92 26 2014-10-10## 4 Lisa Female 89 30 2015-05-10## 5 Laura Female 84 35 2010-08-25

# Or use the function cbind()a <- c (1:5)b <- c (11:15)c <- c (111:115)df <- cbind(a,b,c)print(df)

## a b c## [1,] 1 11 111## [2,] 2 12 112## [3,] 3 13 113## [4,] 4 14 114## [5,] 5 15 115

Add new data (rows) to a data frame

# To add a row, we need the rbind() functiontest.data.to.add <- data.frame(

name = c("Ricardo", "Anna"),gender = c("Male", "Female"),score = c(66,80),age = c(70,36),end_date = as.Date(c("2016-05-05","2016-07-07")))

test.data.new <- rbind(test.data,test.data.to.add)print(test.data.new)

## name gender score age end_date## 1 Piotr Male 78 42 2014-03-01## 2 Pawel Male 88 38 2017-02-13## 3 Paula Female 92 26 2014-10-10## 4 Lisa Female 89 30 2015-05-10## 5 Laura Female 84 35 2010-08-25## 6 Ricardo Male 66 70 2016-05-05## 7 Anna Female 80 36 2016-07-07

Merging data framesMerging allows to extract the subset of 2 data-frames where a given set

of columns match



test.data.1 <- data.frame(name = c("Piotr", "Pawel","Paula","Lisa","Laura"),gender = c("Male", "Male","Female", "Female","Female"),score = c(78,88,92,89,84),age = c(42,38,26,30,35)

)test.data.2 <- data.frame(

name = c("Piotr", "Pawel","notPaula","notLisa","Laura"),gender = c("Male", "Male","Female", "Female","Female"),score = c(78,88,92,89,84),age = c(42,38,26,30,135)

)test.data.merged <- merge(x=test.data.1,y=test.data.2,

by.x=c("name","age"),by.y=c("name","age"))print(test.data.merged)

## name age gender.x score.x gender.y score.y## 1 Pawel 38 Male 88 Male 88## 2 Piotr 42 Male 78 Male 78

ShortcutsR will allow shortcuts provided they’re unique.

test.data$n

## [1] Piotr Pawel Paula Lisa Laura## Levels: Laura Lisa Paula Pawel Piotr

Exercise: data-framesQuestion

1. create a matrix a 3 by 3 matrix with the numbers 1 to 9,

2. convert it to a data-frame,

3. add names for the columns and rows,

4. add a column with the column-totals

5. drop the second column

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


2.2. DATA TYPES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



2.3 Operators

2.3.1 Arithmetic Operators

Arithmetic operators act on each element of an object

v1 <- c(2,4,6,8)v2 <- c(1,2,3,5)v1 + v2 # addition

## [1] 3 6 9 13

v1 - v2 # subtraction

## [1] 1 2 3 3

v1 * v2 # multiplication

## [1] 2 8 18 40

v1 / v2 # division

## [1] 2.0 2.0 2.0 1.6

v1 %% v2 # remainder of division

## [1] 0 0 0 3

v1 %/% v2 # round(v1/v2 -0.5)

## [1] 2 2 2 1

v1 ˆ v2 # v1 to the power of v2

## [1] 2 16 216 32768

2.3.2 Relational Operators

Logical Operators compare vectors element by element

v1 <- c(8,6,3,2)v2 <- c(1,2,3,5)v1 > v2 # bigger than


2.3. OPERATORS

## [1] TRUE TRUE FALSE FALSE

v1 < v2 # smaller than

## [1] FALSE FALSE FALSE TRUE

v1 <= v2 # smaller or equal

## [1] FALSE FALSE TRUE TRUE

v1 >= v2 # bigger or equal

## [1] TRUE TRUE TRUE FALSE

v1 == v2 # equal

## [1] FALSE FALSE TRUE FALSE

v1 != v2 # not equal

## [1] TRUE TRUE FALSE TRUE

2.3.3 Logical Operators

Logical Operators combine vectors element by elementThey only are possible on numeric, logical or complex types

v1 <- c(TRUE,FALSE,TRUE,FALSE,8,6+3i,-2,0, NA)v2 <- c(FALSE,TRUE,TRUE,FALSE,8,6, -1,TRUE,TRUE)v1 & v2 # and

## [1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE NA

v1 | v2 # or

## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

!v1 # not

## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE NA

v1 && v2 # and applied to the first element

## [1] FALSE

v1 || v2 # or applied to the first element

## [1] TRUE



Note that numbers different from zero are considered as FALSE.

2.3.4 Assignment Operators

Assignment operators are left or right# left assignmentx <- 3x = 3x<<- 3

# right assignment3 -> x3 ->> x

#chained assignmentx <- y <- 4

2.3.5 Other Operators

Various operators# : creates a listx <- c(10:20)x

## [1] 10 11 12 13 14 15 16 17 18 19 20

# %in% can find an element in a vector2 %in% x # true since 2 is an element of x

## [1] FALSE

11 %in% x # false since 11 is not in x

## [1] TRUE

x[x %in% c(12,13)] # selects elements from x

## [1] 12 13

x[c(2:4)] # selects the elements with index

## [1] 11 12 13

# between 2 and 4


2.3. OPERATORS

# %*% the matrix multiplication (or crossproduct)M = matrix(c(1,2,3,7,8,9,4,5,6), nrow = 3,ncol = 3,

byrow = TRUE)M %*% t(M)

## [,1] [,2] [,3]## [1,] 14 50 32## [2,] 50 194 122## [3,] 32 122 77

M %*% M

## [,1] [,2] [,3]## [1,] 27 33 39## [2,] 99 123 147## [3,] 63 78 93

exp(M)

## [,1] [,2] [,3]## [1,] 2.718282 7.389056 20.08554## [2,] 1096.633158 2980.957987 8103.08393## [3,] 54.598150 148.413159 403.42879

2.3.6 Loops

For

for (value in vector) {statements

}

will execute the statements for each value in the given vector. eg.

x <- LETTERS[1:5]for ( j in x) {

print(j)}

## [1] "A"## [1] "B"## [1] "C"## [1] "D"## [1] "E"



Repeat

repeat {commandsif(condition) {break}

}

example:

x <- c(1,2)c <- 2repeat {

print(x+c)c <- c+1if(c > 4) {break}

}

## [1] 3 4## [1] 4 5## [1] 5 6

While

while (test_expression) {statement

}

The statements are executed as long the test expression is true.

x <- c(1,2); c <- 2while (c < 4) {

print(x+c)c <- c + 1

}

## [1] 3 4## [1] 4 5

Loop control statementsBreak

The break statement in R programming language has the followingtwo usages:

• When the break statement is encountered inside a loop, the loop is im-mediately terminated and program control resumes at the next state-ment following the loop.


2.3. OPERATORS

• It can be used to terminate a case in the switch statement (covered inthe next chapter).

v <- c(1:5)for ( j in v) {

if (j == 3) {print("--break--")break

}print(j)

}

## [1] 1## [1] 2## [1] "--break--"

Loop control statementsBreak

The next statement will skip the remainder of the current iteration of aloop and starts next iteration of the loop.

v <- c(1:5)for ( j in v) {

if (j == 3) {print("--skip--")next

}print(j)

}

## [1] 1## [1] 2## [1] "--skip--"## [1] 4## [1] 5

2.3.7 Functions

Built-in functionStandard Functionality

Examples:

• demo(): shows some of the capabilities of R



• q(): quits R

• data(): shows the data-sets available

• help(): shows help

• ls(): shows variables

• c(): creates a vector

• seq(): creates a sequence

• mean(): calculates the mean

• max(): returns the maximum

• sum(): returns the sum

• paste(): concatenates vector elements

Help with functions

help(c) # shows help help with the function c?c # same result

apropos("cov") #fuzzy search for functions

Creating a functionUser defined functions

function_name <- function(arg_1, arg_2, ...) {function_bodyreturn_value

}

Example 1

c.surface <- function(radius) {x <- radius ˆ 2 * pireturn (x)}

c.surface(2) + 2

## [1] 14.56637

Note that it is not necessary to explicitly “return” something. A functionwill automatically return the last value that is send to the standard output.So, the following fragment would do exactly the same:


2.3. OPERATORS

c.surface <- function(radius) {radius ˆ 2 * pi}

c.surface(2) + 2

## [1] 14.56637

Editing functions in RMost probably you will work in a modern environment such as the IDE

RStudio. In that case it probably makes sense to have the functions in aseparate file that is then loaded in your code with the command source().However there might be cases where one has only terminal access to R. Inthat case the following functions might come in handy. However in thecase of a Linux server, this is most probably vi, which is not so popularany more these days. To get out of it: press [esc], then type “:q” and press[enter].

# edit the function with vifix(c.surface)

# orc.surface <- edit()

Function with a default argumentAssigning a default value to the argument of a function means that this

argument will get the default value, unless another value is supplied — inother words: if nothing is supplied then the default is used.

It is quite handy to have the possibility to assign a default value to afunction. It allows to save a lot of typing work and makes code more read-able, but it allows also to add a variable to an existing function and make itcompatible with all previous code where that argument was not assumedto be changed.

Example 2

The function paste() collates the arguments provided and returnsthe string containing them all. What is the default separator used inpaste()?

Creating functions with a default value

c.surface <- function(radius = 2) {radius ˆ 2 * pi



}c.surface(1)

## [1] 3.141593

c.surface()

## [1] 12.56637

Exercise: use functionsQuestion

calculate the mean of all numbers from 1000 to 1500

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exercise: create functionsQuestion

Create a function that takes a matrix as input, and calculates the sumof the rows and returns this as a vector.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.8 Packages

The package systemAdditional functions come in “packages”. To use them one needs


2.3. OPERATORS

#download the package (only once):install.packages('DiagrammeR')

#Then load it each session it is usedlibrary(DiagrammeR)

Examples of packages

• To load data

– RODBC, RMySQL, RPostgresSQL, RSQLite: read data froma database

– XLConnect, xlsx: read and write Micorsoft Excel files (of courseYou can also just export your spreadsheets from Excel as csv-files)

– foreign: to use eg SAS data

– Note: among R’s standard functionality is handling text files.Use the functions read.table or its more specific siblings such asread.csv() to read a CSV file and read.fwf() to read a fixed withtable.

• To manipulate data

– dplyr: creating subsets, summarizing, rearranging, and joiningdata sets

– tidyr: changing the layout of your data sets

– stringr: tools for regular expressions and character strings

– lubridate: tools to facilitate working with dates and times

– reshape: tools to present data differently (melt and cast)

• To visualize data

– ggplot2: allows professional graphics (and has many exten-sions)

– ggvis: to build interactive, web based graphics

– rgl: Interactive 3D visualizations with R

– htmlwidgets: build interactive (javascript based) visualizations.Packages that implement htmlwidgets include: leaflet (maps),dygraphs (time series), DT (tables), diagrammeR (diagrams), net-work3D (network graphs), threeJS (3D scatterplots and globes).

– googleVis: use Google Chart tools to visualize data in R.



• To model data

– car: car’s Anova function is popular for making type II andtype III Anova tables

– mgcv: Generalized Additive Models

– lme4/nlme: Linear and Non-linear mixed effects models

– randomForest: random forest methods from machine learning

– multcomp: tools for multiple comparison testing

– vcd: visualization tools and tests for categorical data

– glmnet:Lasso and elastic-net regression methods with cross val-idation

– survival: tools for survival analysis

– caret: tools for training regression and classification models

• To report results

– shiny: make interactive web-apps (eg. explore data and sharefindings with non-programmers)

– R Markdown: write R code in markdown report (when run ren-der is run, R Markdown will replace the code with its resultsand then export your report as an HTML, pdf, or MS Word doc-ument, or a HTML or pdf slideshow. Hence allows automatedreporting. R Markdown is integrated into RStudio.

– knitr: the same tool but for use in LaTeX (and can be used forother markup languages)

– xtable: coverts R objects (such as data frames) and returns thelatex or HTML code

• For Spatial data

– sp, maptools:tools for loading and using spatial data includ-ing shapefiles.

– maps: use map polygons for plots.

– ggmap: use street maps from Google maps as a background inggplots

• For Time Series and Financial data

– zoo: provides a format for saving time series objects

– xts: tools for manipulating time series data sets

– quantmod: tools for downloading financial data, plotting com-mon charts, and doing technical analysis


2.3. OPERATORS

• To write high performance R code

– Rcpp: use C++ code from within R functions for fast speed

– data.table: an alternative way to organize data sets for fasteroperations. Useful for big data.

– parallel: parallel processing in R

• To work with the web

– XML: read and create XML documents with R

– jsonlite: read and create JSON data tables with R

– httr: tools for working with http connections

• To write your own R packages

– devtools: tools for turning your code into an R package

– testthat: provides an easy way to write tests for your code

– roxygen2: (like Oxygen for C++) turns inline code commentsinto documentation pages and builds a package namespace.

Useful functions for libraries

# See the path where libraries are stored.libPaths()

# See the list of installed packageslibrary()

# See the list of currently loaded packagessearch()

2.3.9 Strings

Simple rules

• strings must start and end with single or double quotes

• a string must ends when the same quotes are encountered the nexttime

• until then it can contain the other type of quotes



a <- "Hello"b <- "world"paste(a,b,sep=", ")

## [1] "Hello, world"

c <- "A 'valid' string"

Formatting with format()

format(x, trim = FALSE, digits = NULL, nsmall = 0L,justify = c("left", "right", "centre", "none"),width = NULL, na.encode = TRUE, scientific = NA,big.mark = "", big.interval = 3L,small.mark = "", small.interval = 5L,decimal.mark = getOption("OutDec"),zero.print = NULL, drop0trailing = FALSE, ...)

• x is the vector input.

• digits is the total number of digits displayed.

• nsmall is the minimum number of digits to the right of the decimalpoint.

• scientific is set to TRUE to display scientific notation.

• width is the minimum width to be displayed by padding blanks inthe beginning.

• justify is the display of the string to left, right or center.

Formatting examples

a<-format(100000000,big.mark=" ",nsmall=3,width=20,scientific=FALSE,justify="r")

print(a)

## [1] " 100 000 000.000"

More information? ?format or help(format)


2.3. OPERATORS

Other string functions

• nchar(): returns the number of characters in a string

• toupper(): puts the string in uppercase

• tolower(): puts the string in lowercase

• substring(x,first,last): returns a substring from x startingwith the “first” and ending with the “last”

• strsplit(x,split): split the elements of a vector into substringsaccording to matches of a substring “split”.

there is also a family of search functions: grep(), grepl(), regexpr(),gregexpr(), and regexec() that supply powerful search and re-place capabilities.

sub() will replace the first of all matches and gsub() will replaceall matches.



2.4 Selected Data Interfaces

Reading text in a variable can be done by t <- readLines(file.choose())or by providing the file name directly t <- readLines("R.book.txt"),but typically that is not what we need. In order to manipulate data andnumbers it will be necessary to load data in a vector or data-frame for ex-ample.

2.4.1 CSV Files

Import a CSV filedownload the CSV file with currency exchange rates vs the euro from

http://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/html/index.en.html or directlyhttp://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip?c6f8f9a0a5f970e31538be5271051b3c

# To read a CSV-file it needs to be in the current directorygetwd() # show actual working directorysetwd("/home/philippe/Downloads") # change working directorydata <- read.csv("eurofxref-hist.csv")is.data.frame(data)ncol(data)nrow(data)head(data)hist(data$CAD)

plot(data$USD,data$CAD)

Finding data

# get the maximum exchange ratemaxCAD <- max(data$CAD)# use SQL-like selectiond0 <- subset(data, CAD == maxCAD)d1 <- subset(data, CAD > maxCAD - 0.1)d1[,1]

## [1] 2008-12-30 2008-12-29 2008-12-18 1999-02-03## [5] 1999-01-29 1999-01-28 1999-01-27 1999-01-26## [9] 1999-01-25 1999-01-22 1999-01-21 1999-01-20## [13] 1999-01-19 1999-01-18 1999-01-15 1999-01-14


2.4. SELECTED DATA INTERFACES

Histogram of data$CAD

data$CAD

Freq

uenc

y

1.2 1.3 1.4 1.5 1.6 1.7 1.8

020

040

060

080

010

00

Figure 2.4: The histogram of the CAD

●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●

●●●

●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●● ●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●

●●●●●●●●●●●

●●●●●

●●●●●

●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●

●●●

●

●

●●●

●

●●●●●

●

●

●●●●●●●●

●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●

●●●●

●●●●●

●●●●●●●●

●●●

●●●●

●

●●● ●●●●●●

●●●●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●●●● ●●●

●

●●●●●●●

●●●●●●

●●●●●●●●●●●●

●

●●

●●●●●●

●●●●●

●●

●●●●

●

●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●

●●●●●

●●●●

●●

●

●●●●●●●

●●●●●

●●●●●●

●●

●

●●●●●

●●●●●●●●●●

●●●●●●●

●●

●●●●●

●●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●●●●●●

●●●●●●

●●●●●●●●●●●

●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●● ●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●

●●●●●●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●● ●●

●●●●●●●●●

●● ●●

●●●●●●

●●●●●●●

●●

●●

●●

●

●●●●●●●●

●●●●●●

●● ●

●●●

●●●●

●●

●●●●●

● ●●●●●●●

●●●●●●●●●●●●●●● ●●

●●●●

●●

●●●●●●

●●●●●●●●● ●

●●

●●●●●●●

●●

●●●●●●

●●●

●●●●●●

●●●●●

●●●●●●●●●

● ●●●

●

●●

●●●●●●

●●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●●●

●●

●●●●●●●

●●●●

●●

●●●●●●●●

●●●●

●●

●

●

●●●●

●

●●

●●●

●●●●●

●●●●●

●●

●●●● ●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●●●●●●

●●●●●●● ●●

●●●

●●●●

●●●

●●●●●

●●●

●●●

●●●●

●●●

●●●

●●●●●●●●

●●●●●

●●

●●

●●●●

●●●●●●●●

●

●●●

●●

●●●●

●●●●

●

●● ●●●●●

●●●●●●

●●●●●

●●

●●

●●●

●●

●

●

●

●●● ●●

●●

●●●

● ●●●●●●●●●●

●●●●●●●

●●●●

●●●●●

●●●●●

●●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●

●●●● ●●

●●●●

●●●●

●● ●●●

●●

●●●●●●●

●●●●●

●●●●● ●●

●●●

●●●

●●●●

●●

●●●●●

●●

●●●●●

●●●

●●

●●●●●● ●

●●●

●●●●●●●

●

●●●●

●

●●

●

●●

●●●●●●●

●●●

●●

●●●

●●●●

●●●

●●●●●●●●●

●●

●

●●●●

●●●●●●●

●●●●●●

●

●

●●●●●●●●●●●●●

●●

●

●●●

●● ●

●●

●●

●●

●●●●●●

●●●●●●●

●●●●

●●●

●●●●●●●

●●●●●

●●●

●

●●

●●●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●●●

●●●●●

●●●●●

●

●●

●●●

●●●

●●●

●●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●●

●●

●

●●●●●

●●●

●●

●

●●●●●

● ● ●●

●●

●

●

● ●

●●

●●

●

●

● ●●●

●

●

●●

●

●●

●●●

● ●●

●●●●

●●

●

●●

●●

●●●●

●●

●●●

●●●● ●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●

●

●●●●●

●●●●●●

●●●●●●●●

●●●

●

●

●

●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●●●

●

●

●

●●●

●●●

●●●

●●●●

●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●●●●●●●

●●●

●●●●●

●●●

●

●●●●

●●

●●●

●●●●●●●

●●●●●●

●

●

●●●●●

●●●●●

●

●●●

●●●

●

●

●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●

●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●

●●●●

●●●●●●●●●

●

●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●

●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●

●●

●●●

●●●●●●●●

●●

●●●●●●

●●●●●

●●●●

●●●●●

●●●●●●●

●●●●

●●

●●●

●●●●

●●●●●●●

●●

●●●●●●

●●●●●

●●●●

●●●●

●●

●●●●●●●●

●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●

●●

●●●●●●●●

●●●●●●●

●●●●●●●● ●●●● ●●●●

●●● ● ●●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●

●

●●●●

●●●●

●●●●

●●●●●●●

●●●●

●●●●●

●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●

●●●●●●●●

●●●●●

●●●●●

●●●

●●●●●

●●●

●●●

●●●●●●●●

●●●●●●

●●●

●●

●●●●

●●●●●●●●

●●●●

●●●●

●

●●●●

●●●●●●●●

●●●●●●●●●

●●●●

●●

●

●

●●

●●

●●●●●●●●

●●●●●●

●●●●

●●

●●●●●●

●●●

●●●●

●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●

●●●●●●●

●●●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●

●●

●●●●●●

●

●●

●●

● ●●●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●●●

●●●●●●

●●●

●●●●●

●●●●

●●

●●●●●●●

●

●●●●●●

●●●

●●●●●●●●●

●●

●

●●

●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●●●

●

●●●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●

●●●

●●●●●●●●

●●

●●●●●

●●

●●●●●●

●●●●●●●●

●

●●●●●●

●●●

●●●●

●●●● ●●●●

●

●

●●●●

●●●●●●●

●●

●●●●●

●●●

●●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●

●●●●●

●

●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●

●●

●●●●

●

●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●

●

●

●

●●●

●●

●●●●●●

●

●●●●●

●●●

●●●●●●●

●●●●●

●●●

●

●

●●

●●●

●●

●●●●●●●

●●●

●●●●●

●

●●

●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●

●

●●●●

●●●●

●●●●

●●●●

●●●●●

●

●●

●●●●

●

●

●

●●

●

●●

●●

●●●●●●

●●●●

●●

●

●●●●●●●

●●●●

●●

●●●●●

●●

●●●

●●●●

●●●●

●

●●●

●

●●●●●●●●●●

●●

●●●

●●

●

●●●

●

●●●●

●●

●●●

●●●●●●●●●●●●●

●●

●

●●●●

●●●●●

●●

●●

●●●●●●●●

●●●

●●●●

●

●●●●●

●●

●

●●●

●●●●●

●●

●●●●

●●●●●●

●●●

●●●

●●●●●

●●●

●●●

●

●

●●

●●●●●

●

●●

●●

●●●●

●●

●●●

●●●●●

●●●●

●●

●●●●●●●●●●

●

●●●●●

●●●●

●●●●●●

●●

●●●●●

●●

●●

●

●

●

●

●

●●●

●●●●●●●

●●

●●●●●

●●●●●●●●●

●●●●

●●

●●●

●●●●

●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●●

●

●●

●●●●●

●●

●●●

●●●●●●

●●●●

●●

●●●●●●●

●●●

●●

●●●

●

●●●●

●●●

●●●●

●●●

●●●●●

●●●●●●

●

●●●

●

●●●●●●●●●●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●

●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●

●●●●●●●

●●

●●●

●●●●●●

●●

●

●●●●●

●●●●●●●●●●

●

●●

●●●

●●

0.8 1.0 1.2 1.4 1.6

1.2

1.3

1.4

1.5

1.6

1.7

1.8

data$USD

data

$CAD

Figure 2.5: A scatter-plot of one variable with another.



## [17] 1999-01-13 1999-01-12 1999-01-11 1999-01-08## [21] 1999-01-07 1999-01-06 1999-01-05 1999-01-04## 4718 Levels: 1999-01-04 1999-01-05 ... 2017-06-05

d2<- data.frame(d1$Date,d1$CAD)d2

## d1.Date d1.CAD## 1 2008-12-30 1.7331## 2 2008-12-29 1.7408## 3 2008-12-18 1.7433## 4 1999-02-03 1.7151## 5 1999-01-29 1.7260## 6 1999-01-28 1.7374## 7 1999-01-27 1.7526## 8 1999-01-26 1.7609## 9 1999-01-25 1.7620## 10 1999-01-22 1.7515## 11 1999-01-21 1.7529## 12 1999-01-20 1.7626## 13 1999-01-19 1.7739## 14 1999-01-18 1.7717## 15 1999-01-15 1.7797## 16 1999-01-14 1.7707## 17 1999-01-13 1.8123## 18 1999-01-12 1.7392## 19 1999-01-11 1.7463## 20 1999-01-08 1.7643## 21 1999-01-07 1.7602## 22 1999-01-06 1.7711## 23 1999-01-05 1.7965## 24 1999-01-04 1.8004

hist(d2$d1.CAD)

Working with dates

# Get the recent peaksrecent <- subset(d2,

as.Date(d1.Date) > as.Date("2005-01-01"))print(recent)

## d1.Date d1.CAD## 1 2008-12-30 1.7331## 2 2008-12-29 1.7408## 3 2008-12-18 1.7433



Histogram of d2$d1.CAD

d2$d1.CAD

Freq

uenc

y

1.70 1.72 1.74 1.76 1.78 1.80 1.82

02

46

810

Figure 2.6: The histogram of the most recent values of the CAD only.

Writing to a CSV file

write.csv(d2,"output.csv", row.names = FALSE)new.d2 <- read.csv("output.csv")print(new.d2)

## d1.Date d1.CAD## 1 2008-12-30 1.7331## 2 2008-12-29 1.7408## 3 2008-12-18 1.7433## 4 1999-02-03 1.7151## 5 1999-01-29 1.7260## 6 1999-01-28 1.7374## 7 1999-01-27 1.7526## 8 1999-01-26 1.7609## 9 1999-01-25 1.7620## 10 1999-01-22 1.7515## 11 1999-01-21 1.7529## 12 1999-01-20 1.7626## 13 1999-01-19 1.7739## 14 1999-01-18 1.7717## 15 1999-01-15 1.7797## 16 1999-01-14 1.7707## 17 1999-01-13 1.8123## 18 1999-01-12 1.7392## 19 1999-01-11 1.7463



## 20 1999-01-08 1.7643## 21 1999-01-07 1.7602## 22 1999-01-06 1.7711## 23 1999-01-05 1.7965## 24 1999-01-04 1.8004

note: without the “row.names = FALSE” statement this procedure wouldadd a row “X”

2.4.2 Excel Files

Excel Files are similar to CSV files

# install the package xlsx if not yet doneif (!any(grepl("xlsx",installed.packages()))){

install.packages("xlsx")}library(xlsx)data <- read.xlsx("input.xlsx", sheetIndex = 1)

2.4.3 Databases

DatabasesR can connect to many popular database systems. For example MySQL:

as usual there is a package that will provide this functionality

if(!any(grepl("xls", installed.packages()))){install.packages("RMySQL")}

library(RMySQL)

Connecting to the database

# the connection is stored in an R object myConnection# it needs the database name (db_name), username and passwordmyConnection = dbConnect(MySQL(),

user = 'root',password = 'xxx',dbname = 'db_name',host = 'localhost')

# eg. list the tables available in this database.dbListTables(myConnection)



Fetching data from a database

# Prepare the query for the databaseresult <- dbSendQuery(myConnection,

"SELECT * from tbl_students WHERE age > 33")

# Fetch all the records(with n = -1) and store it in a data frame.data <- fetch(result, n = -1)

Update QueriesThe dbSendQuery() function can be used to send any query, includ-

ing UPDATE, INSERT, CREATE TABLE and DROP TABLE queries so wecan push results back to the database.

sSQL = ""sSQL[1] <- "UPDATE tbl_students

SET score = 'A' WHERE raw_score > 90;"sSQL[2] <- "INSERT INTO tbl_students

(name, class, score, raw_score)VALUES ('Robert', 'Grade 0', 88,NULL);"

sSQL[3] <- "DROP TABLE IF EXISTS tbl_students;"for (k in c(1:3)){dbSendQuery(myConnection, sSQL[k])}

Create tables from R data framesR can write the value of a data frame into a table

dbWriteTable(myConnection, "tbl_name",data_frame_name[, ], overwrite = TRUE)

Finally close the connection with dbDisconnect(myConnection, ...).



2.5 Charts & Graphs

2.5.1 Pie Charts

Pie charts

Definition 4 .:. pie() .:.

pie(x, labels = names(x), edges = 200, radius =0.8, clockwise = FALSE, init.angle = if(clockwise)90 else 0, density = NULL, angle = 45, col = NULL,border = NULL, lty = NULL, main = NULL, ...)where the most important parameters are

• x: a vector of non-negative numerical quantities. The values inx are displayed as the areas of pie slices

• labels: strings with names for the slices

• radius: the radius of the circle of the chart (value between 1and +1)

• main: indicates the title of the chart

• col: the colour palette

• clockwise: a logical value indicating if the slices are drawnclockwise or anti clockwise

Pie chart example

x <- c(10, 20, 12) # Create data for the graphlabels <- c("good", "average", "bad")#for saving to a file:

png(file = "feedback.jpg") # Give the chart file a namepie(x,labels) # Plot the chartdev.off() # Save the file

## pdf## 2

pie(x,labels) # Show in the R Graphics screen


2.5. CHARTS & GRAPHS

good

average

bad

Figure 2.7: A pie-chart in R.

2.5.2 Bar Charts

The function barplot()

Definition 5 .:. barplot() .:.

barplot(height, width=1, xlab=NULL, ylab=NULL,main=NULL, names.arg=NuLL, col=NULL, ...) Someparameters:

• height: is the vector or matrix containing numeric valuesused in chart

• xlab: the label for the x-axis

• ylab: is the label for y axis

• main: is the title of the chart

• names.arg: is a vector of names of each bar

• col: is used to give colors to the bars in the graph.



France Poland UK Spain Belgium

Sales 2016

Regions

Sale

s in

EU

R

050

100

150

200

Figure 2.8: A standard bar-chart based on a vector.

An example for barplot()

sales <- c(100,200,150,50,125)regions <- c("France", "Poland", "UK", "Spain", "Belgium")barplot(sales, width=1,

xlab="Regions", ylab="Sales in EUR",main="Sales 2016", names.arg=regions,border="blue", col="brown")

Stacked bar charts

# Create the input vectors.colors <- c("orange","green","brown")regions <- c("Mar","Apr","May","Jun","Jul")product <- c("Licence","Maintenance","Consulting")

# Create the matrix of the values.values <- matrix(c(20,80,0,50,140,10,50,80,20,10,30,

10,25,60,50),nrow = 3,ncol = 5,byrow = FALSE)

# Create the bar chart.barplot(values,main = "Sales 2016",

names.arg = regions,xlab = "region",ylab = "sales in EUR",col = colors)



Mar Apr May Jun Jul

Sales 2016

region

sale

s in

EU

R

050

100

150

200

LicenceMaintenanceConsulting

Figure 2.9: A bar-chart based on a matrix will produce stacked bars.

# Add the legend to the chart.legend("topright", product, cex = 1.3, fill = colors)

2.5.3 Boxplots

Boxplots



Definition 6 .:. boxplot() .:.

boxplot(formula, data = NULL,notch = FALSE,varwidth = FALSE, names, main = NULL, ... with:Following is the description of the parameters used

• formula: a vector or a formula.

• data: the data frame.

• notch: a logical value (set to TRUE to draw a notch)

• varwidth: a logical value (set to true to draw width of the boxproportionate to the sample size)

• names: the group labels which will be printed under each box-plot.

• main: the title to the graph.

A boxplot exampleLet’s use the dataset ships (from the library “MASS”)

library(MASS)boxplot(incidents ˜ type,data=ships,col="green",

main="# of incidents in function of type")

boxplot(incidents/(76-year) ˜ type,data=ships,col="green",main="# of incidents per year in function of type")



●

●

A B C D E

010

2030

4050

60# of incidents in function of type

Figure 2.10: Boxplots show information about the central tendency (median) aswell as the spread of the data.

●

●

●

A B C D E

05

1015

# of incidents per year in function of type

Figure 2.11: In this boxplot the number of incidents is compared to the number ofyears a ship is in service.



2.5.4 Histograms

The function hist()

Definition 7 .:. hist() .:.

hist(x, breaks = "Sturges", freq = NULL,probability = !freq, include.lowest = TRUE, right= TRUE, density = NULL, angle = 45, col = NULL,border = NULL, main = paste("Histogram of" ,deparse(substitute(x))), xlim = range(breaks),ylim = NULL, xlab = deparse(substitute(x)), ylab,axes = TRUE, plot = TRUE, labels = FALSE, nclass =NULL, warn.unused = TRUE, ...) with the most importantparameters:

• x: the vector containing numeric values to be used in the his-togram

• main: the title of the chart

• col: the color of the bars

• border: the border color of each bar

• xlab: the title of the x-axis

• xlim: the range of values on the x-axis

• ylim: the range of values on the y-axis

• breaks: one of

– a vector giving the breakpoints between histogram cells,

– a function to compute the vector of breakpoints,

– a single number giving the number of cells for the his-togram,

– a character string naming an algorithm to compute thenumber of cells,

– a function to compute the number of cells

• freq: TRUE for frequencies, FALSE for probability density

Histogram example



Histogram of incidents

incidents

Freq

uenc

y

0 10 20 30 40 50 60

05

1015

2025

30

Figure 2.12: A histogram in R is produced by the hist() function.

library(MASS)incidents <- ships$incidents# figure 1: with a rug and fixed breakshist(incidents,

col=c("red","orange","yellow","green","blue","purple"))rug(jitter(incidents)) # add the tick-marks

# figure 2: user-defined breaks for the bucketshist(incidents,

col=c("red","orange","yellow","green","blue","purple"),ylim=c(0,0.3), breaks=c(0,2,5,10,20,40,80),freq=FALSE)

2.5.5 ScatterplotsScatterplots show many points plotted in the Cartesian plane. Each pointrepresents the combination of two variables. One variable is chosen in thehorizontal axis and another in the vertical axis.

Making scatterplots



Histogram of incidents

incidents

Den

sity

0 20 40 60 80

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Figure 2.13: In this histogram the breaks are changed and the y-axes is now cal-ibrated as a probability. Note that leaving freq=TRUE would give the wrong im-pression that there are more observations in the wider brackets.

Definition 8 .:. scatterplot .:.

plot(x, y, main, xlab, ylab, xlim, ylim, axes,...) with

• x: the data set for the horizontal axis

• y: the data set for the vertical axis

• main: the tile of the graph


• ylab: the title of the x-axis



• pch: the display symbol

• axes: indicates whether both axes should be drawn on theplot.



●

●

pch arguments0 ● 1 2 3 4

5 6 7 8 9

● 10 11 12 ● 13 14

15 ● 16 17 18 ● 19

● 20 ● 21 22 23 24

25

Figure 2.14: Some plot characters. Most other characters will just plot themselves.

Plot charactersWith the argument pch (that is short for “plot character”) it is possible

to change the symbol that is displayed on the scatterplot. Integer valuesbetween 0 to 25 specify a symbol as shown in Figure 2.14. It is possible tochange the color via the argument col. pch values from 21 to 25 are filledsymbols that allow you to specify a second color bg for the backgroundfill. Most other characters supplied to pch other than this will plot thatthemselves.

Scatterplot example

# prepare the datalibrary(MASS)mpg2l <- function(mpg = 0) {

100 * 3.785411784 / 1.609344 / mpg}mtcars$l <- mpg2l(mtcars$mpg)plot(x = mtcars$hp,y = mtcars$l, xlab = "Horse Power",

ylab = "L per 100km", main = "Horse Power vs Milage",pch = 22, col="red", bg="yellow")



50 100 150 200 250 300

1015

20

Horse Power vs Milage

Horse Power

L pe

r 100

km

Figure 2.15: A scatter-plot needs an x and a y variable.

2.5.6 Line Graphs

It might be sufficient to add lines to a scatterplot (with the lines() function),but other methods are available

Making line plots



Definition 9 .:. line plots .:.

plot(x, type , main, xlab, ylab, xlim, ylim, axes,sub, asp ...) with

• x: the data set for the horizontal axis

• y: the data set for the vertical axis (optional)

• type: indicates the type of plot to be made:

– ”p” for *p*oints,

– ”l” for *l*ines,

– ”b” for *b*oth,

– ”c” for the lines part alone of ”b”,

– ”o” for both *o*verplotted,

– ”h” for *h*istogram like (or high-density) vertical lines,

– ”s” for stair *s*teps,

– ”S” for other *s*teps, see Details in the documentation,

– ”n” for no plotting.

• main: the tile of the graph


• ylab: the title of the x-axis



• axes: indicates whether both axes should be drawn on theplot.

• sub: the sub-title

• asp: the y/x aspect ratio

A line-plot example

# prepare the datayears <- c(2000,2001,2002,2003,2004,2005)sales <- c(2000,2101,3002,2803,3500,3450)plot(x = years,y = sales, type = 'b',

xlab = "Years", ylab = "Sales in USD",



●

●

●

●

●

●

2000 2001 2002 2003 2004 2005

2000

2500

3000

3500

The evolution of our sales

Years

Sale

s in

USD

●

top sales

Figure 2.16: A line plot of the type b.

main = "The evolution of our sales")points(2004,3500,col="red",pch=16) # highlight one pointtext(2004,3400,"top sales") # annotate the highlight

2.5.7 Plotting FunctionsWhile the function plot() allows to draw functions, there is a specific func-tion curve:

fn1 <- function(x) sqrt(1-(abs(x)-1)ˆ2)fn2 <- function(x) -3*sqrt(1-sqrt(abs(x))/sqrt(2))curve(fn1,-2,2,ylim=c(-3,1),col="red",lwd=4,

ylab = expression(sqrt(1-(abs(x)-1)ˆ2) +++ fn_2))curve(fn2,-2,2,add=TRUE,lw=4,col="red")text(0,-1,expression(sqrt(1-(abs(x)-1)ˆ2)))text(0,-1.25,"++++")text(0,-1.5,expression(-3*sqrt(1-sqrt(abs(x))/sqrt(2))))

This also shows the standard capacity of R to include mathematical for-mulae into its plots and even format annotations in a LaTeX-like markuplanguage.


2.6. DISTRIBUTIONS

−2 −1 0 1 2

−3−2

−10

1

x

1−(

x−1

)2++

+fn_

2

1 − (x − 1)2

++++

− 3 1 − x 2

Figure 2.17: Two line plots plotted by the function curve().

2.6 Distributions

Distribution functions in R

The names of the functions related to statistical distributions in R arecomposed of two sections: the first letter refers to the function (see below)and the remainder is the distribution name.

• d: the pdf (probability density function)

• p: the cdf (cumulative probability density function)

• q: the quantile function

• r: the random number generator



distribution R-name distribution R-namenormal norm weibull weibull

exponential exp binomial binomlog-normal lnorm negative binomial nbinom

logistic logis χ2 chisqgeometric geom uniform unif

poisson pois gamma gammat t cauchy cauchyf f hypergeometric hyper

beta beta

Table 2.1: Common distributions and their names in R.

2.6.1 Normal DistributionIn a random collection of data from independent sources, usually the datais distributed normal. This distribution is also called the Gaussian Distri-bution.

When plotting a graph with the value of the variable in the horizontalaxis and the count of the values in the vertical axis we get a bell shapecurve. The center of the curve is the mean of the data set. In the graph, fiftypercent of values lie to the left of the mean and the other fifty percent lie tothe right of the graph.

The Normal Distribution in RR has four in built functions to generate normal distribution. They are

described below.

• dnorm(x, mean, sd): the height of the probability distribution

• pnorm(x, mean, sd): the cumulative distribution function (theprobability of the observation to be lower than x)

• qnorm(p, mean, sd): gives a number whose cumulative valuematches the given probability value p

• rnorm(n, mean, sd): generates normally distributed variables

with

• x: a vector of numbers

• p: a vector of probabilities

• n: the number of observations(sample size)


2.6. DISTRIBUTIONS

Histogram of obs

obs

Den

sity

0 5 10 15 20

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Figure 2.18: A comparison between a set of random numbers drawn from thenormal distribution (yellow) and the theoretical shape of the normal distributionin blue.

• mean: the mean value of the sample data (default is zero)

• sd: the standard deviation (default is 1).

Illustrating the normal distribution

obs <- rnorm(600,10,3)hist(obs,col="yellow",freq=FALSE)x <- seq(from=0,to=20,by=0.001)lines(x, dnorm(x,10,3),col="blue",lwd=4)

Is the SP500 normally distributed?

library(MASS)hist(SP500,col="yellow",freq=FALSE,border="red")x <- seq(from=-5,to=5,by=0.001)lines(x, dnorm(x,mean(SP500),sd(SP500)),col="blue",lwd=2)

QQ-plots



Histogram of SP500

SP500

Den

sity

−8 −6 −4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Figure 2.19: The same plot for the returns of the SP500 index seems acceptable,though there are outliers (where the normal distribution converges fast to zero).

library(MASS)qqnorm(SP500,col="red"); qqline(SP500,col="blue")

2.6.2 Binomial DistributionThe binomial distribution model deals with finding the probability of anevent which has only two possible outcomes.

For example the probability of finding exactly 6 heads in tossing a coinrepeatedly for 10 times is estimated during the binomial distribution.

The Binomial Distribution in RR has four in-built functions to generate binomial distribution. They are

described below.

• dbinom(x, size, prob): the density function

• pbinom(x, size, prob): the cumulative probability of an event

• qbinom(p, size, prob): gives a number whose cumulative valuematches a given probability value


2.6. DISTRIBUTIONS

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

● ●●

●

●

●●

●●

●

●

●●

●

●●

●

●

●●

●

●●●

●

●●

●●●

●

●

●●●●

●

● ●

●

●

●

●●

●

●

●●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●●

●●●

●

●

●

●

●

●

●●●

●

●

● ●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●

●

●●

●

●●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

● ●

●

●

●

●●

●

●●●

●●

●●

●

●●

●

●

●●

●●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●●●

●

●●

●

●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●●

●●

●●●

●●

●

●●●●●

●

●

●

●●●

●●●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

● ●

●

●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●● ●

●●

●

●●

●●

●

●

● ●●

●

●

●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

● ●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●●

●●●●

●

●

●●

●●●

●●●

●

●●

●

●●

●●●

●

●

●

●

●

●●●

●

●●●

●●●

● ●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●●●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●●

●●●●

●

●●

●

● ●●

●●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●●

●●

●●●

●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●●

●

●●

●●

●

●

●●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●●

●●●

●●

●●

● ●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●

●

●

●

● ●

●

●●

●

●●

●●●

●

●

●●●●

●

●●

●

●●●

●●

●●●●

●●

●

●●

●

●●

●●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●●

●●●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●●●

●

●●

●●

●●

●

●●

●●

●

●●●●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●●●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●●

●

●●●● ●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●●

●●

●

●

●●

●●

●●

●

●●

●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●●●●● ● ●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●●

●

●●

●●

●

●

●●●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−6−4

−20

24

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 2.20: A qq-plot is a better way to judge if a a set of observations is normallydistributed or not.

• rbinom(n, size, prob): generates random variables followingthe binomial distribution

Following parameters are used:

• x: a vector of numbers

• p: is a vector of probabilities

• n: the number of observations

• size: the number of trials

• prob the probability of success of each trial

An example

# Probability of getting 5 or less heads from 10 tosses of# a coin.pbinom(5,10,0.5)

## [1] 0.6230469

# visualize this for one to 10 numbers of tossesx <- 1:10



●

●

●

●

●

●

●

● ● ●

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

10 tosses of a coin

Number of tails

prob

of m

axiu

m x

tails

Figure 2.21: The probability to get maximum x tails with the binomial distribu-tion.

y <- pbinom(x,10,0.5)plot(x,y,type="b",col="blue", lwd=3,

xlab="Number of tails",ylab="prob of maxium x tails",main="10 tosses of a coin")

# How many heads should we at least expect (with a probability# of 0.25) when a coin is tossed 10 times.qbinom(0.25,10,1/2)

## [1] 4

Generate random variables

# Find 20 random numbers of tails from and event of 10 tosses# of a coinrbinom(20,10,.5)

## [1] 3 6 3 3 6 3 4 5 6 4 6 4 4 3 8 7 7 3## [19] 7 10


2.7. TIME SERIES ANALYSIS

2.7 Time Series Analysis

2.7.1 Time Series in RTime series is a series of data points in which each data point is associatedwith a timestamp. A simple example is the price of a stock in the stockmarket at different points of time on a given day. Another example is theamount of rainfall in a region at different months of the year. R languageuses many functions to create, manipulate and plot the time series data.The data for the time series is stored in an R object called time-series object.It is also a R data object like a vector or data frame.

Time seriesThe time series object is created by using the ts() function.

ts(data = NA, start = 1, end = numeric(), frequency = 1,deltat = 1, ts.eps = getOption("ts.eps"), class = ,names = )

with

• data: a vector or matrix containing the values used in the time series.

• start: the start time for the first observation in time series.

• end: the end time for the last observation in time series.

• frequency the number of observations per unit time.

– frequency = 12 pegs the data points for every month of a year

– frequency = 4 pegs the data points for every quarter of a year

– frequency = 6 pegs the data points for every 10 minutes of anhour

– frequency = 24*6 pegs the data points for every 10 minutes of aday

Except the parameter ”data” all other parameters are optional. To check ifan object is a timeseries, we can use the function is.ts() and as.ts(x)will coerce the variable x into a time series object.

Time-series example



Time

SP50

0.ts

1990 1992 1994 1996 1998 2000

−6−4

−20

24

Figure 2.22: The standard plot for a time-series object for the returns of the SP500index in the 1990s.

# Start from the SP500 data (from the MASS library)# It is just a vectorhead(SP500)

## [1] -0.2588908 -0.8650307 -0.9804139 0.4504321 -1.1856666## [6] -0.6629097

# Convert it to a time series object.SP500.ts <- ts(SP500,start = c(1990,1),frequency = 260)

plot(SP500.ts)

Multiple time-series in one object

val = c(339.97)for (k in 2:length(SP500)){

val[k] = val[k-1] * (SP500[k-1] / 100 + 1)}# Convert both series to a matrixM <- matrix(c(SP500,val),nrow=length(SP500))

# Convert the matrix to a time-series object



−6−4

−20

24

Dai

ly R

etur

n in

Pct

400

600

800

1200

1990 1992 1994 1996 1998 2000

Valu

e

Time

SP500 in the 1990s

Figure 2.23: The function plot.ts() will keep the the x-axis for both variables thesame.

SP <- ts(M, start=c(1990,1),frequency=260)colnames(SP) <- c("Daily Return in Pct","Value")

plot.ts(SP,type="l", main="SP500 in the 1990s")

2.7.2 ForecastingForecasting is the process of making predictions about the future and it isone of the main reasons to do statistics. The idea is that the the idea thatthe past holds valuable clues about the future and that by studying the pastone can make reasonable suggestions about the future.

Every company will need to make forecasts in order to plan, every in-vestor needs to forecast market and company performance in order to makedecisions, etc. For example, if the profit grew the last 4 years every year be-tween 15 and 20% it is reasonable to forecast a hefty growth rate based onthis endogenous data. Of course this makes abstraction of exogenous data,such as an economic depression in the making.



No forecast is accurate and the only thing that we know for sure is thatthe future will hold surprises. It is of course possible to attach some degreeof certainty to our forecast. This is referred to as “confidence intervals” andthis is more valuable than a precise prediction.

There are a number of quantitative techniques that can be utilized togenerate forecasts. While some are straightforward, others might for exam-ple incorporate exogenous factors and become necessarily more complex.

Our brain is in essence an efficient pattern-recognition machine that isso efficient that it will even see patterns where there are none1.

Typically one relies on the following key concepts:

• Trend: a trend is a long-term increase or decrease (despite short-termfluctuations). This is a specific form of auto-correlation (see below).

• Seasonality: a seasonal pattern is a pattern that repeats itself (eg. salesof ice-cream in the summer, temperatures in winter compared to sum-mer, etc.)

• Autocorrelation: this refers to the the pheneomena whereby values attime t are influenced by previous values. (R provides an autocorrela-tion plot that helps to find the proper lag structure and the nature ofauto correlated values

• Stationary: a time series is said to be stationary if there is no system-atic trend, no systematic change in variance, and if strictly periodicvariations or seasonality do not exist

• Random: a time series that shows no specific pattern (eg. a randomwalk or Brownian motion). Though note that there can still be a sys-tematic shift (a long term trend).

Moving Average

In absence of a clear trend or as a starting point, the moving average is aversatile tool.

require(forecast)

## Loading required package: forecast

g <- read.csv('data/gdp/gdp_pol_sel.csv') # get some dataattach(g) # the names of the data are now always availableplot(year,GDP.per.capitia.in.current.USD,type='b',lwd=3)

1This refers to the bias “hot hand fallacy” – see eg. De Brouwer (2012)



●

●● ●

●

●

● ●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

1990 1995 2000 2005 2010 2015

2000

4000

6000

8000

1000

012

000

1400

0

year

GD

P.pe

r.cap

itia.

in.c

urre

nt.U

SD

Figure 2.24: A first plot to show the data before we start. This will allow us toselect a suitable method for forecasting.

# show the result:plot(g.movav,col="blue",lw=4,

main="Forecast of GDP per capita of Poland",ylab="income in current USD")

lines(year,GDP.per.capitia.in.current.USD,col="red",type='b')

Testing the accuracy of forecasts – backtesting

# testing accuracy of the model by samplingg.ts.tst <- ts(g.data[1:20],start=c(1990))g.movav.tst <- forecast(ma(g.ts.tst,order=3),h=5)

## Warning in ets(object, lambda = lambda, allow.multiplicative.trend= allow.multiplicative.trend, : Missing values encountered.Using longest contiguous portion of time series

accuracy(g.movav.tst,g.data[22:26])

## ME RMSE MAE MPE## Training set 28.8983 342.4069 220.7202 0.6165671## Test set -1206.5907 1925.5699 1527.0191 -9.2928769## MAPE MASE ACF1## Training set 3.364013 0.3707313 -0.05974669## Test set 11.599210 2.5648476 NA



Forecast of GDP per capita of Poland

inco

me

in c

urre

nt U

SD

1990 1995 2000 2005 2010 2015 2020

5000

1000

015

000

2000

0

●

●● ●

●

●● ●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 2.25: A forecast based on moving average.

plot(g.movav.tst,col="blue",lw=4,main="Forecast of GDP per capita of Poland",ylab="income in current USD")


Testing the accuracy of forecasts – backtestingIn the forecast package, there is an automatic forecasting function that

will run through possible models and select the most appropriate modelgive the data. This could be an auto regressive model of the first oder(AR(1)), an ARIMA model with the right values for p, d, and q, or some-thing else that is more appropriate.

train = ts(g.data[1:20],start=c(1990))test = ts(g.data[21:26],start=c(2010))arma_fit <- auto.arima(train)arma_forecast <- forecast(arma_fit, h = 6)arma_fit_accuracy <- accuracy(arma_forecast, test)arma_fit; arma_forecast; arma_fit_accuracy

## Series: train## ARIMA(0,1,0) with drift#### Coefficients:



Forecast of GDP per capita of Polandin

com

e in

cur

rent

USD

1995 2000 2005 2010

5000

1000

015

000

2000

025

000

●● ● ●

●

●● ●

● ● ●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

Figure 2.26: A backtest for our forecast.

## drift## 515.5991## s.e. 231.4786#### sigmaˆ2 estimated as 1074618: log likelihood=-158.38## AIC=320.75 AICc=321.5 BIC=322.64## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95## 2010 12043.19 10714.69 13371.70 10011.419 14074.97## 2011 12558.79 10680.00 14437.58 9685.431 15432.15## 2012 13074.39 10773.35 15375.43 9555.257 16593.52## 2013 13589.99 10932.98 16247.00 9526.444 17653.54## 2014 14105.59 11134.96 17076.22 9562.406 18648.77## 2015 14621.19 11367.03 17875.35 9644.381 19598.00## ME RMSE MAE MPE## Training set 0.06078049 983.4411 602.2902 -2.3903997## Test set 54.36338215 1036.0989 741.8024 0.1947719## MAPE MASE ACF1 Theil's U## Training set 8.585820 0.7612222 -0.03066223 NA## Test set 5.668504 0.9375488 0.04756832 0.9928242

plot(arma_forecast, col="blue",lw=4,main="Forecast of GDP per capita of Poland",ylab="income in current USD")





inco

me

in c

urre

nt U

SD

1990 1995 2000 2005 2010 2015

5000

1000

015

000

2000

0

●

●● ●

●

●● ●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 2.27: Optimal moving average forecast.

Note that ARIMA stands for autoregressive integrated moving averagemodel. It is a generalization of an autoregressive moving average (ARMA)model. Both of these models are fitted to time series data either to betterunderstand the data or to predict future points in the series (forecasting).ARIMA models are applied in some cases where data shows evidence ofnon-stationarity, where an initial differencing step (corresponding to the”integrated” part of the model) can be applied one or more times to elimi-nate the non-stationarity.

Simple Exponential smoothingExponential smoothing assigns higher weights to the most recent ob-

servations (the weight will decrease exponentially. The effect will be that anew dramatic event has a much faster impact, and that the “memory of it”will decrease exponentially.

g.exp <- ses(g.data,5,initial="simple")g.exp # simple exponential smoothing uses the last value as

## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95## 2016 12558.87 11144.352 13973.39 10395.550 14722.19## 2017 12558.87 10558.438 14559.30 9499.473 15618.27## 2018 12558.87 10108.851 15008.89 8811.889 16305.85## 2019 12558.87 9729.832 15387.91 8232.229 16885.51## 2020 12558.87 9395.909 15721.83 7721.538 17396.20



Forecast of GDP per capita of Polandin

com

e in

cur

rent

USD

1990 1995 2000 2005 2010 2015 2020

5000

1000

015

000

●

●● ●

●

●

● ●● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 2.28: Forecasting with an exponentially smoothed moving average.

# the forecast and finds confidence intervals around it

plot(g.exp,col="blue",lw=4,main="Forecast of GDP per capita of Poland",ylab="income in current USD")


Holt Exponential smoothing

g.exp <- holt(g.data,5,initial="simple")g.exp # Holt exponential smoothing


plot(g.exp,col="blue",lw=4,main="Forecast of GDP per capita of Poland",ylab="income in current USD")





inco

me

in c

urre

nt U

SD

1990 1995 2000 2005 2010 2015 2020

5000

1000

015

000

2000

0

●

●● ●

●

●● ●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 2.29: Holt exponentially smoothed moving average.

Seasonal Decomposition

The Seasonal Trend Decomposition using Loess (STL) is an algorithm thatwas developed to help to divide up a time series into three componentsnamely: the trend, seasonality and remainder. The methodology was pre-sented by Robert Cleveland, William Cleveland, Jean McRae and Irma Ter-penning in the Journal of Official Statistics in 1990. The STL is availablewithin R via the stl() function.

Note that a series with multiplicative effects can often by transformedinto series with additive effects through a log transformation: new.ts <-log(old.ts).

# we use the data nottem# Average Monthly Temperatures at Nottingham, 1920-1939nottem.stl = stl(nottem, s.window="periodic")plot(nottem.stl)

The four graphs are the original data, seasonal component, trend com-ponent and the remainder and this shows the periodic seasonal patternextracted out from the original data and the trend that moves around be-tween 47 and 51 degrees Fahrenheit. There is a bar at the right hand side ofeach graph to allow a relative comparison of the magnitudes of each com-ponent. For this data the change in trend is less than the variation doing tothe monthly variation.



3040

5060

data

−10

−50

510

seas

onal

4849

5051

trend

−6−4

−20

24

1920 1925 1930 1935 1940

rem

aind

er

time

Figure 2.30: Using the stl-function to decompose data in a seasonal part and atrend

Exponential ModelsBoth the HoltWinters() function in the base installation, and the ets()

function in the forecast package, can be used to fit exponential models.

# simple exponential: models levelfit <- HoltWinters(g.data, beta=FALSE, gamma=FALSE)

# double exponential: models level and trendfit <- HoltWinters(g.data, gamma=FALSE)

# triple exponential: models level, trend, and seasonal# components#fit <- HoltWinters(g.data) # fails on our data-set

# predictive accuracylibrary(forecast)accuracy(forecast(fit,5))

## ME RMSE MAE MPE## Training set -69.84485 1051.488 711.7743 -2.775476## MAPE MASE ACF1## Training set 9.016881 0.8422587 0.008888197




inco

me

in c

urre

nt U

SD

1990 1995 2000 2005 2010 2015 2020

5000

1000

015

000

2000

0

●

●● ●

●

●● ●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 2.31: The Holt-HoltWinters model fits an exponential trend. Here we plotthe double exponential model.

# predict next 5 future valuesforecast(fit, 5)


plot(forecast(fit, 5),col="blue",lw=4,main="Forecast of GDP per capita of Poland",ylab="income in current USD")


# Use the Holt-Winters method for the temperaturesn.hw <- HoltWinters(nottem)n.hw.fc <- forecast(n.hw,50)plot(n.hw.fc)

Exercise



Forecasts from HoltWinters

1920 1925 1930 1935 1940 1945

3040

5060

70

Figure 2.32: The Holt-HoltWinters model applied to the temperatures in Notting-ham.

Question

Use the moving average method on the the temperatures in Notting-ham (nottam). Does it work? Which model would work better?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .





Elements of DescriptiveStatistics

3.1 Measures of Central Tendency

A measure of central tendency is a single value that attempts to describea set of data by identifying the central position within that set of data. Assuch, measures of central tendency are sometimes called measures of cen-tral location. They are also classed as summary statistics. The mean (oftencalled the average) is most likely the measure of central tendency that youare most familiar with, but there are others, such as the median and themode.

The mean, median and mode are all valid measures of central tendency,but under different conditions, some measures of central tendency becomemore appropriate to use than others. In the following sections, we will lookat the mean, mode and median, and learn how to calculate them and underwhat conditions they are most appropriate to be used.

3.1.1 Mean

Probably the most used measure of central tendency is the “mean”. In thissection we will start from the arithmetic mean, but illustrate some otherconcepts that might be more suited in some situations too.

Arithmetic Mean

CHAPTER 3. ELEMENTS OF DESCRIPTIVE STATISTICS

Definition 10 .:. Arithmetic Mean .:.

x =N∑n=1

P (x).x (for discrete distributions)

=

∫ +∞

−∞x.f(x) dx (for continuous distributions)

The unbiased estimator of the mean for K observations xk is:

E[x] =1

K

K∑k=1

xk

probability density function

the mean in R

# The mean of a vectorx <- c(1,2,3,4,5,60)mean(x)

## [1] 12.5

mean(ships$service)

## [1] 4089.35

# Ignoring missing valuesx <- c(1,2,3,4,5,60,NA)mean(x)

## [1] NA

mean(x,na.rm = TRUE)

## [1] 12.5

# This works also for a matrixM <- matrix(c(1,2,3,4,5,60), nrow=3)mean(M)

## [1] 12.5

Generalized means


3.1. MEASURES OF CENTRAL TENDENCY

Definition 11 .:. f-mean .:.

x = f−1

(1

n.K∑k=1

f(xk)

)

Popular choices for are:

• f(x) = x : arithmetic mean,

• f(x) = 1x : harmonic mean,

• f(x) = xm: power mean,

• f(x) = lnx : geometric mean, so x =(∏K

k=1 xk

) 1K

The Power MeanOne particular generalized mean is the power mean or H¨lder mean. It

is defined for a set of K positive numbers xk by

x(m) =

(1

n·K∑k=1

xmk

) 1m

by choosing particular values for m one can get the quadratic, arithmetic,geometric and harmonic means.

• m→∞: maximum of xk

• m = 2: quadratic mean

• m = 1: arithmetic mean

• m→ 0: geometric mean

• m = 1: harmonic mean

• m→ −∞: minimum of xk

Which mean makes most sense?What is the average return when you know that the share price had the

following returns: −50%,+50%,−50%,+50%. Try the arithmetic mean andthe mean of the log-returns.



returns <- c(0.5,-0.5,0.5,-0.5)

# arithmetic meanaritmean <- mean(returns)

# the ln-meanlog_returns <- returnsfor(k in 1:length(returns)) {

log_returns[k] <- log( returns[k] + 1)}

logmean <- mean(log_returns)exp(logmean) - 1

## [1] -0.1339746

# What is the value of the investment after these returnsV_0 <- 1V_T <- V_0for(k in 1:length(returns)) {

V_T <- V_T * (returns[k] + 1)}

V_T

## [1] 0.5625

# Compare this to our predictions## mean of log-returnsV_0 * (exp(logmean) - 1)

## [1] -0.1339746

## mean of returnsV_0 * (aritmean + 1)

## [1] 1

3.1.2 Median

The medianThe median is the middle-value so that 50% of the observations are

lower and 50% are larger.

# The median of an R-objectx <- c(1,2,3,4,5,60,NA)median(x)


3.1. MEASURES OF CENTRAL TENDENCY

## [1] NA

median(x,na.rm = TRUE)

## [1] 3.5

3.1.3 Arithmetic Mode

The modeThe mode is the value that has highest probability to occur. For a series

of observations, this should be the one that occurs most often. Note that themode is also defined for variables that have no order-relation (even labelssuch as “green”, “yellow”, etc. have a mode, but not a mean or median —without further abstraction or a numerical representation).

In R the function mode() or storage.mode() returns a characterstring describing how a variable is stored. In fact, R does not have a stan-dard function to calculate mode, so let’s create our own:

my_mode <- function(v) {uniqv <- unique(v)uniqv[which.max(tabulate(match(v, uniqv)))]}

# now test this functionx <- c(1,2,3,3,4,5,60,NA)my_mode(x)

## [1] 3

x1 <- c("relevant", "N/A", "undesired", "great", "N/A","undesired", "great", "great")

my_mode(x1)

## [1] "great"

# text from https://www.r-project.org/about.htmlt <- "R is available as Free Software under the terms of theFree Software Foundations GNU General Public License insource code form. It compiles and runs on a wide variety ofUNIX platforms and similar systems (including FreeBSD andLinux), Windows and MacOS."v <- unlist(strsplit(t,split=" "))my_mode(v)

## [1] "and"



3.2 Measures of Variation or Spread

Definition 12 .:. variance .:.

V AR(X) = E[(X − X

)2]

Standard DeviationDefinition 13 .:. standard deviation .:.

SD(X) :=

√√√√ 1

N − 1

N∑n=1

(Xn − X

)2

t <- test.data$scorevar(t)

## [1] 29.2

sd(t)

## [1] 5.403702

sqrt(var(t))

## [1] 5.403702

sqrt(sum((t - mean(t))ˆ2)/(length(t) - 1))

## [1] 5.403702

Median Absolute DeviationDefinition 14 .:. MAD .:.

MAD(X) =1

1.4826median (|X −median(X)|)

mad(t)

## [1] 5.9304


3.2. MEASURES OF VARIATION OR SPREAD

mad(t,constant=1)

## [1] 4

The default “constant = 1.4826” (approximately 1Φ−1( 3

4)

= 1qnorm(3/4) en-

sures consistency, i.e.,

E[mad(X1, ..., Xn)] = σ

for Xi distributed as N(µ, σ2) and large n.



3.3 Measures of Covariation

The basic measure for linear interdependence is covariance, defined as

covar(X) = E [(X − E[X]) (Y − E[Y ])]

= E[XY ]− E[X]E[Y ]

An important metric for linear relationship is the Pearson correlationcoefficient ρ.

Pearson correlation

Definition 15 .:. Pearson Correlation Coefficient .:.

ρXY =covar(X,Y )

σXσY

=(X − E[X])(Y − E[Y ])√(X − E[X])(Y − E[Y ])

=: covar(x, y)

cor(mtcars$hp,mtcars$wt)

## [1] 0.6587479

The Spearman correlationThe Spearman correlation is the correlation applied to the ranks of the

data. It is one if an increase in the variable X is always accompanied withan increase in variable Y .

cor(rank(mtcars$hp),rank(mtcars$wt))

## [1] 0.7746767

The Spearman correlation checks for a relationship that can be moregeneral than only linear. It will be one if X increases when Y increases.

Exercise: correlation


3.3. MEASURES OF COVARIATION

Question

Consider the vectors

1. x = c(1, 2, 33, 44) and y = c(22, 23, 100, 200),

2. x = c(1 : 10) and y = 2 ∗ x,

3. x = c(1 : 10) and y = exp(x),

Plot y in function of x. What is their Pearson correlation? What istheir Spearman correlation? How do you understand that?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



3.4 Chi Square Tests

Chi-Square test is a statistical method to determine if two categorical vari-ables have a significant correlation between them. Both those variablesshould be from same population and they should be categorical like Yes/No,Male/Female, Red/Green etc.

For example, we can build a data set with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with theflavor of the ice-cream they prefer. If a correlation is found we can plan forappropriate stock of flavors by knowing the number of gender of peoplevisiting.

Chi-Square test in R

Definition 16

chisq.test(data)where data is the data in form of a table containing the count valueof the variables

An example for chisq.test()

# we use the data-set mtcars from MASSdf <- data.frame(mtcars$cyl,mtcars$am)chisq.test(df)

## Warning in chisq.test(df): Chi-squared approximation maybe incorrect

#### Pearson's Chi-squared test#### data: df## X-squared = 25.077, df = 31, p-value = 0.7643

conclusion: the p-value is higher than 0.05, so there is no significantcorrelation



Elements of InferentialStatistics

4.1 Regression Models

4.1.1 Linear Regression

Linear RegressionWith a linear regression we try to estimate an unknown variable y based

on a known variable x and some constants (a and b). It’s form is

y = ax+ b

library(MASS)

# Explore the dataplot(survey$Height, survey$Wr.Hnd)

# Create the modellm1 <- lm (formula = Wr.Hnd ˜ Height, data = survey)summary(lm1)

#### Call:## lm(formula = Wr.Hnd ˜ Height, data = survey)#### Residuals:## Min 1Q Median 3Q Max## -6.6698 -0.7914 -0.0051 0.9147 4.8020

CHAPTER 4. ELEMENTS OF INFERENTIAL STATISTICS

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●●

●●

●

●

●

●

150 160 170 180 190 200

1416

1820

22

survey$Height

surv

ey$W

r.Hnd

Figure 4.1: A scatter-plot generated by the line “plot(survey$Height, sur-vey$Wr.Hnd)”.

#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -1.23013 1.85412 -0.663 0.508## Height 0.11589 0.01074 10.792 <2e-16 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.525 on 206 degrees of freedom## (29 observations deleted due to missingness)## Multiple R-squared: 0.3612,Adjusted R-squared: 0.3581## F-statistic: 116.5 on 1 and 206 DF, p-value: < 2.2e-16

# predictionsh <- data.frame(Height = 150:200)Wr.lm <- predict(lm1,h)plot(survey$Height, survey$Wr.Hnd,col="red")lines(t(h),Wr.lm,col="blue",lwd=3)

In previous code we visualized the model by adding the predictionsof the linear model (completed with lines between them, so the result isjust a line). The function abline() provides another elegant way to draw


4.1. REGRESSION MODELS

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●●

●●

●

●

●

●

150 160 170 180 190 200

1416

1820

22

survey$Height

surv

ey$W

r.Hnd

Figure 4.2: A plot visualizing the linear regression model (the data in red and theregression in blue.

straight lines in plots. The function takes as arguments the intercept andslope.

# Or use the function abline()plot(survey$Height, survey$Wr.Hnd,col = "red",

main = "Hand span in function of Height",abline(lm(survey$Wr.Hnd ˜ survey$Height ),

col='blue',lwd=3),cex = 1.3,pch = 16,xlab = "Height",ylab ="Hand Span")

Exercise: linear regression

Question

Consider the data-set mtcars from the library MASS. Make a linearregression of the fuel consumption in function of the parameter thataccording to you has the most explanatory power. Study the residu-als. What is your conclusion?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

●

● ●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●●

●●

●

●

●

●

150 160 170 180 190 200

1416

1820

22

Hand span in function of Height

Height

Han

d Sp

an

Figure 4.3: Using the function abline() and cleaning up the titles.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.2 Multiple Linear Regression

Multiple Linear RegressionMultiple regression is a relationship between more than two known

variables to predict one variable.

y = b+ a1x1 + a2x2 + . . .+ anxn

In R the lm() function will handle this too.

# We use mtcars from the library MASSmodel <- lm(mpg˜disp+hp+wt, data = mtcars)print(model)

#### Call:## lm(formula = mpg ˜ disp + hp + wt, data = mtcars)#### Coefficients:



## (Intercept) disp hp wt## 37.105505 -0.000937 -0.031157 -3.800891

# Accessing the coefficientsa_disp <- coef(model)[2]a_hp <- coef(model)[3]a_wt <- coef(model)[4]

print(a_disp)

## disp## -0.0009370091

print(a_hp)

## hp## -0.03115655

print(a_wt)

## wt## -3.800891

# This allows us to manually predict the fuel consumption# eg. for the Mazda Rx42.23 + a_disp * 160 + a_hp * 110 + a_wt * 2.62

## disp## -11.30548

Exercise: multiple linear regression

Question

Consider the data-set mtcars from the library MASS. Make a linearregression that predicts the fuel consumption of a car. Make sure toinclude only significant variables and remember that the significanceof a variable depends on the other variables in the model.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



4.1.3 Logistic RegressionLogistic regression (aka logit regression) is a regression model where theunknown variable is categorical (can have only a limited number of val-ues): it can either be “0” or “1”. In reality if can refer to any mutuallyexclusive concept such as: repay/default, pass/fail, win/lose, survive/dieor healthy/sick.

Cases where the dependent variable has more than two outcome cate-gories may be analysed in multinomial logistic regression, or, if the mul-tiple categories are ordered, in ordinal logistic regression. In the termi-nology of economics, logistic regression is an example of a qualitative re-sponse/discrete choices.

Generalized form of the Logistic Regression

ln

{P [Y = 1|X]

P [Y = 0|X]

}= α+

N∑n=1

fn(Xn)

with X = (X1, X2, . . . , XN ) the set of prognostic factors.This type of model can be used to predict the probability that Y = 1 or

to study the fn(Xn) and hence understand the dynamics of the problem.The general additive logistic regression can be solved by estimating the

fn() via a back-fitting algorithm within a Newton-Rapshon procedure. Be-low we will focus on the linear additive logistic regression.

Logistic RegressionAssuming a linear model, the probability that Y = 1 is hence modelled

as:y =

1

1 + e−(b+a1x1+a2x2+a3x3+...)

# Consider the relation between the hours studied and passing# an examhours <- c(0,0.50, 0.75, 1.00, 1.25, 1.50, 1.75,

1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 3.25,3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50)

pass <- c(0,0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,1, 0, 1, 1, 1, 1, 1, 1)

d <- data.frame(cbind(hours,pass))m <- glm(formula=pass ˜ hours, family = binomial,

data = d)

Note that the function glm() is also used for the Poisson regression;see Chapter 4.1.4 on page 107.



0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

hours

pass

Figure 4.4: The grey diamonds with red border are the data-points (not passed is0 and passed is 1) and the blue line represents the logistic regression model (or theprobability to succeed the exam in function of the hours studied

# Visualize the resultsplot(hours,pass,col="red",pch = 23,bg="grey")pred <- 1 / (1+ exp(-(coef(m)[1] + hours * coef(m)[2])))lines(hours,pred,col="blue",lwd=3)

4.1.4 Poisson Regression

The Poisson regression assumes that the response variable Y has a Poissondistribution, and that the logarithm of its expected value can be modeledby a linear combination of certain variables. The Poisson regression modelis also known as a “log-linear model”.

In particular the Poisson regression can only be useful where the un-known variable (response variable) can never be negative, for example inthe case of predicting counts (eg. numbers of events).

Poisson Regression



Definition 17 .:. Poisson Regression .:.

The general form of the Poisson Regression is

log(y) = b+ a1x1 + a2x2 + bnxn

with:

• y: the predicted variable (aka response variable or unknownvariable)

• a and b are the numeric coefficients.

• x is the known variable (or the predictor variable)

the Poisson Regression in RAlso the Poisson Regression can be handled by the function glm() in

R

glm(formula,data,family)

where:

• formula is the symbol presenting the relationship between the vari-ables,

• data is the data-set giving the values of these variables,

• family is R object to specify the details of the model and for thePoisson Regression it’s value is ’Poisson’

Note that the function glm() was also used for the logistic regression;see Chapter 4.1.3 on page 106.

ExampleWe will check if we can estimate the number of cylinders of a car based

on its horse power and weight, using the dataset mtcars

m <- glm(cyl ˜ hp + wt, data = mtcars, family = "poisson")summary(m)

#### Call:## glm(formula = cyl ˜ hp + wt, family = "poisson", data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max



## -0.59240 -0.31647 -0.00394 0.29820 0.68731#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 1.064836 0.257317 4.138 3.5e-05 ***## hp 0.002220 0.001264 1.756 0.079 .## wt 0.124722 0.090127 1.384 0.166## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 16.5743 on 31 degrees of freedom## Residual deviance: 4.1923 on 29 degrees of freedom## AIC: 126.85#### Number of Fisher Scoring iterations: 4

Weight does not seem to be relevant, so we drop it and try again (onlyusing horse power):

m <- glm(cyl ˜ hp, data = mtcars, family = "poisson")summary(m)

#### Call:## glm(formula = cyl ˜ hp, family = "poisson", data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -0.97955 -0.30748 -0.03387 0.28155 0.73433#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 1.3225669 0.1739422 7.603 2.88e-14 ***## hp 0.0032367 0.0009761 3.316 0.000913 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 16.5743 on 31 degrees of freedom## Residual deviance: 6.0878 on 30 degrees of freedom## AIC: 126.75#### Number of Fisher Scoring iterations: 4



4.1.5 Non-Linear RegressionIn many cases one will observe that the relation between the unknown vari-able and the known variables is not simply linear. Whenever we plot thedata and see that the relation is not a straight line but rather a curve the re-lation is non-linear. It is possible to model this by applying a function suchas squaring, a sine, a logarithm or exponential to the known variable(s)and then running a linear regression. However, R, has a specific functionfor this: nls().

In Least Square regression, we establish a regression model in whichthe sum of the squares of the vertical distances of different points from theregression curve is minimized. We generally start with a defined model andassume some values for the coefficients. We then apply the nls() functionof R to get the more accurate values along with the confidence intervals.

Syntax of Non-Linear Regression

Definition 18 .:. nls() .:.

nls(formula, data, start) with

1. formula a non-linear model formula including variables andparameters,

2. data the data-frame used to optimize the model,

3. start a named list or named numeric vector of starting esti-mates.

Example for nls()

# consider observations for dt = d0 + v0 t + 1/2 a tˆ2t <- c(1,2,3,4,5,1.5,2.5,3.5,4.5,1)dt <- c(8.1,24.9,52,89.2,136.1,15.0,37.0,60.0,111.0,8)

# Plot these values.plot(t,dt,xlab="time",ylab="distance")

# Take the assumed values and fit into the model.model <- nls(dt ˜ d0 + v0 * t + 1/2 * a * tˆ2,

start = list(d0 = 1,v0 = 3,a = 10))

# plot the model curvesimulation.data <- data.frame(t = seq(min(t),max(t),len = 100))



●

●

●

●

●

●

●

●

●

●

1 2 3 4 5

2040

6080

100

120

140

time

dist

ance

Figure 4.5: The results of the non-linear regression with nls(). This plot indicatesthat there is one outlier and you might want to re-run the model without thisobservation.

lines(simulation.data$t,predict(model,newdata = simulation.data), col="red")

# Learn about the modelsummary(model) # the summary

#### Formula: dt ˜ d0 + v0 * t + 1/2 * a * tˆ2#### Parameters:## Estimate Std. Error t value Pr(>|t|)## d0 4.981 4.660 1.069 0.321## v0 -1.925 3.732 -0.516 0.622## a 11.245 1.269 8.861 4.72e-05 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.056 on 7 degrees of freedom#### Number of iterations to convergence: 1## Achieved convergence tolerance: 1.822e-07



print(sum(residuals(model)ˆ2))# squared sum of residuals

## [1] 65.39269

print(confint(model)) # confidence intervals

## Waiting for profiling to be done...

## 2.5% 97.5%## d0 -6.038315 15.999559## v0 -10.749091 6.899734## a 8.244167 14.245927

Note that R has a shorthand notation built in for the function residuals():resid() will do the same.


4.2. THE MODEL PERFORMANCE

4.2 The Model Performance

The choice of a model is a complicated task. One needs to consider:

• Simple predictive Model Quality (i.e. Height of the lift curve / AUC)

• Generalization ability of the Model (Difference of model quality be-tween Creation set and Test set / Does the model overfit?)

• Explanatory Power of the model (Does the model make sense of thedata? Can it explain something?)

• Model Stability (Whats the confidence interval of the model on thelift curves?)

• Is the model robust against the erosion of time?

In this section we will focus on the first issue: the intrinsic predictivequality of the model

4.2.1 Mean Square Error (MSE)

Definition 19 .:. Mean Square Error (MSE) .:.

R-squared is residual variance not explained by the model, and it isdefined by :

MSE(y, y) :=1

N

N∑k=1

(yk − y)2

4.2.2 R-Squared

While MSE provides a reliable measure of variation that is not explainedby the model, it is sensitive to units (eg the use of millimetres or kilometreswill result in a different NSE for the same model). To solve this issue wedefine R2 as a normalized version of MSE.



Definition 20 .:. R-squared .:.

R-squared is the “fraction of the variance explained by the model”.We can define R-squared as:

R2 :=

∑Nk=1 (yk − y)2∑Nk=1 (yk − y)2

Example

m <- lm(data=mtcars,formula=mpg˜wt)summary(m)

#### Call:## lm(formula = mpg ˜ wt, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -4.5432 -2.3647 -0.1252 1.4096 6.8727#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***## wt -5.3445 0.5591 -9.559 1.29e-10 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

summary(m)$r.squared

## [1] 0.7528328

hint: get more information via help(summary.lm)

Exercise: model performance for linear regression

Question

Use the dataset mtcars (from the library MASS), and try to find themodel that best explains the consumption (mpg).



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In fact the most general form of R2 is

R2 = 1− SSresSStot

with SSres the sum of squares of the residuals.. For OLS regressions it is so that SSres + SSreg = SStot and hence we

have the formulaR2 =

SSregSStot

4.2.3 Mean Average Deviation (MAD)In the case where outliers matter less or where outliers distort the resultsone can rely on more robust methods that are based on the median. Manyvariations of these measures can be useful. We present a selection.

Definition 21 .:. Mean Average Deviation (MAD) .:.

MAD(y, y) :=1

N

N∑k=1

|yk − y|

4.2.4 The performance of binary classification mod-els

AUC for logistic regression

A popular way to calculate the AUC is the pROC library.However, before we start we will first introduce another example of a

logistic regression based on data from the Titanic disaster.

A logistic model for surviving the Titanic disasterthe data is already divided in a set to train the model (titanic train) and

a set to test the model (titanic test). The data was supplied by the websitehttps://www.kaggle.com/c/titanic/data.



# if necessary: install.packages("titanic")library(titanic)library(pROC)

## Type ’citation("pROC")’ for a citation.## ## Attaching package: ’pROC’## The following objects are masked from ’package:stats’:## ##cov, smooth, var

d <-titanic_trainstr(d) # explore the data-setm1 <- glm(formula=Survived ˜ Sex + Pclass, family = binomial,

data = d)

Make predictionsThere are of course many ways to make a predictions for a logistic re-

gression. Here we choose for the following function that is available ongithub.com.

# copy the function log_red from:# https://github.com/joyofdata/joyofdata-articles/blob/master# /roc-auc/log_reg.R

d1 <- data.frame(cbind(d$Survived,d$Sex,d$Pclass))names(d1) <- c("survived","sex","class")

predictions <- log_reg(d1, size=10)str(predictions)

## 'data.frame': 891 obs. of 2 variables:## $ survived: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 2 2 2 ...## $ pred : num 0.0958 0.0958 0.9057 0.5889 0.0958 ...

Plot the distributions of the test

# copy: https://github.com/joyofdata/joyofdata-articles/blob/# master/roc-auc/plot_pred_type_distribution.Rlibrary(ggplot2)plot_pred_type_distribution(predictions, 0.7)

If we consider survival as a positive (1) and death as a negative (0), thenthe plot illustrates the tradeoff we face upon choosing a threshold. As weremember from Figure 4.4 on page 107 the probability of a positive outcomeslowly increases and there will be a grey zone. So we have to decide wherewe put the cutoff. If we increase the threshold the number of false positive



0.25

0.50

0.75

0 1

survived

pred

type

FN

FP

TN

TP

Threshold at 0.70

Figure 4.6: A visualization of the predictions of a model for surviving the Titanicdisaster. Note that FN = false negative, TN = True negative, FP = false positive,TP = true positive.

(FP) results is lowered, however, the number of false negative (FN) resultsincreases. In order to help us with that decision, we can consider a costfunction that sums the cost of each false positive and each false negative.

Receiver Operating CharacteristicThe question of how to balance false positives and false negatives (de-

pending on the cost/consequences of either mistake) was first addressedin a rigorous framework during World War II in context of interpretationof radar signals for identification of enemy air planes. In order to visu-alize and quantify the impact of a threshold on the FP/FN-tradeoff theROC curve was used. The ROC curve is the interpolated curve made ofpoints whose coordinates are functions of the threshold: threshold = θ ∈R, here θ ∈ [0, 1]

ROCx(θ) = FPR(θ) =FP (θ)

FP (θ) + TN(θ)=FP (θ)

#N

ROCy(θ) = TPR(θ) =TP (θ)

FN(θ) + TP (θ)=FP (θ)

#P= 1− FN(θ)

#P= 1− FNR(θ)

In lending terminology one usually refers to a “negative” as a “bad”(ie. the customer defaults on his loan) and to a “positive” as a “good” (the



0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

FPR

TPR

ROC

300

400

500

600

700

0.00 0.25 0.50 0.75 1.00

threshold

cost

cost function

threshold at 0.70 − cost of FP = 1, cost of FN = 2

Figure 4.7: The ROC and cost curves for our model on surviving the Titanicdisaster.

customer repays the loan). This allows us to express the TPR as the “frac-tion of goods accepted”, and the FPR as the “fraction of the bads accepted”.This terminology allows to understand what the impact of the error is: as θincreases one will accept more goods (and hence earn more), but also willaccept more bads (and lose money on those customers). This is illustratedby a cost function in Figure 4.7.

In terms of hypothesis tests where rejecting the null hypothesis is con-sidered a positive result the FPR (false positive rate) corresponds to theType I error, the FNR (false negative rate) to the Type II error and (1 FNR)to the power. So the ROC for above distribution of predictions would be:

# https://github.com/joyofdata/joyofdata-articles/blob/master/# roc-auc/calculate_roc.Rroc <- calculate_roc(predictions, 1, 2, n = 100)

# https://github.com/joyofdata/joyofdata-articles/blob/master/# roc-auc/plot_roc.Rlibrary(grid)library(gridExtra)plot_roc(roc, 0.7, 1, 2)

The dashed lines indicate the location of the (FPR, TPR) correspondingto a threshold of 0.7. Note that the low corner (0,0) is associated with athreshold of 1 and the top corner (1,1) with a threshold of 0.



The cost function and the corresponding colouring of the ROC pointsillustrate that an optimal FPR and TPR combination is determined by theassociated cost. Depending on the use case false negatives might be morecostly than false positive or vice versa. Here I assumed a cost of 1 for FPcases and a cost of 2 for FN cases.

Area Under (ROC) CurveThe optimal point on the ROC curve is (FPR, TPR) = (0,1). No false pos-

itives and all true positives. So the closer we get there the better, this wouldimply that we could use the distance with the point (0, 1) as a measure ofquality of the model.

The second essential observation is that the curve is by definition mono-tonically increasing.

θ ≤ θ′ =⇒ TPR(θ) ≤ TPR(θ′)

A good way to understand this inequality is by remembering that theTPR is also the fraction of observed survivors that is by the model pre-dicted to survive (in “lending terms”: the fraction of goods accepted). Thisfraction can only increase and never decrease. Another illustration can befound by considering Figure 4.6 by mentally pushing the threshold (redline) up. Indeed the true positives can only increase as the cutoff θ is in-creased.

One also observes that a random prediction would follow the identitylineROCy = ROCx. because if the model has no prediction power then thetrue positive rate will increase equally with the true negative rate.

Therefore a reasonable models ROC is located above the identity lineas a point below it would imply a prediction performance worse than ran-dom. If that would happen it is of course possible to do the invert theprediction.

All those features combined make it reasonable to summarize the ROCinto a single value by calculating the area of the convex shape below theROC curve this is the AUC. The closer the ROC gets to the optimal pointof perfect prediction the closer the AUC gets to 1.

# AUC for the examplelibrary(pROC)auc(predictions$survived, predictions$pred)

## Area under the curve: 0.7859

Exercise: model performance for logistic regression



Question

Calculate and discuss the model performance of the linear modelsmade in Chapter 4.1.1 on page 101 and in Chapter 4.1.2 on page 104.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2.5 AUC Gini for logistic regressionThe Gini coefficient is an alternative measure based on the same plot asused for the AUC. The name refers to Corrado Gini Gini (1912). The popu-lar Gini coefficient aims to be a measure for inequality in a give society.

For a logistic regression model the Gini is the area comprised betweenthe identity (random model) and the total area that is better than the ran-dom model (this area is 1). The relation with the AUC is:

Gini = 2×AUC − 1

So where a really bad AUC is close to 0.5, the same really bad Gini willbe close to zero.

4.2.6 Kolmogorov-Smirnov (KS) for logistic regres-sionThe KS test is another measure that aims to summarize the power of amodel in one parameter. In general the KS is the largest distance between2 cumulative distribution functions.

KS = sup |F1(x)− F2(x)|

This KS of a logistic regression model is the KS applied to the distribu-tion of the bads and the goods (in function of their score). The higher theKS, the better model we have. Typically 0.2 is considered as a good value.



The package ROCR

The package ROCR provides a complete assessment of binary classifica-tion models such as logistic regressions. This section can be seen as analternative approach where we don’t have to copy any code and just usethe standard package.

We continue the example of Titanic disaster of previous chapter andrepeat below how to load the data and fit an initial model.

# if necessary: install.packages("titanic")# install.packages("ROCR")library(titanic)d <-titanic_trainm1 <- glm(formula=Survived ˜ Sex + Pclass, family = binomial,

data = d)

First we need to make some predictions.

# this will provides us scores between 0 and 1predicScore <- predict(object=m1,type="response")head(predicScore)

## 1 2 3 4 5## 0.09705221 0.91166115 0.60180274 0.91166115 0.09705221## 6## 0.09705221

# introduce a cut-off level above which we assume survivalpredic <- ifelse(predicScore > 0.7, 1, 0)head(predic)

## 1 2 3 4 5 6## 0 1 0 1 0 0

A first and very simple approach is to calculate the confusion matrix.This confusion matrix shows the correct classifications and miss-classifications.

# the confusion matrixconfusion_matrix <- table(predic, d$Survived)rownames(confusion_matrix) <- c("predicted_death",

"predicted_survival")colnames(confusion_matrix) <- c("observed_death",

"observed_survival")confusion_matrix

#### predic observed_death observed_survival## predicted_death 540 181## predicted_survival 9 161



# as a precentage:confusion_matrixPerc <- sweep(confusion_matrix, 2,

margin.table(confusion_matrix,2),"/")round(confusion_matrixPerc,2)

#### predic observed_death observed_survival## predicted_death 0.98 0.53## predicted_survival 0.02 0.47

So far we used the standard functions of R. The package ROCR providesthe additional functionality to visualize the ROC-curve.

library(ROCR)

## Loading required package: gplots#### Attaching package: ’gplots’## The following object is masked from ’package:stats’:#### lowess

pred <- prediction(predict(m1,type="response"), d$Survived)

#visualize the ROC curveplot(performance(pred, "tpr", "fpr"), col="blue", lwd=3)abline(0,1,lty=2)

Plotting the accuracy (in function of the cut-off) of the model is easilyavailable.

plot(performance(pred, "acc"), col="blue", lwd=3)

Also measures like AUC, GINI and KS are readily available now.

# AUCAUC <- attr(performance(pred, "auc"), "y.values")[[1]]AUC

## [1] 0.8328354

# GINIpaste("the Gini is:",2 * AUC - 1)

## [1] "the Gini is: 0.665670703778268"

# KS# using the function ks.test() from the package {stats}ks.test(attr(pred,"predictions")[[1]], d$Survived,

alternative = 'greater')



False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.8: ROC curve

Cutoff

Accu

racy

0.2 0.4 0.6 0.8

0.4

0.5

0.6

0.7

0.8

Figure 4.9: The accuracy for the decision tree on the Titanic data.



## Warning in ks.test(attr(pred, "predictions")[[1]], d$Survived,alternative = "greater"): p-value will be approximate in thepresence of ties

#### Two-sample Kolmogorov-Smirnov test#### data: attr(pred, "predictions")[[1]] and d$Survived## Dˆ+ = 0.38384, p-value < 2.2e-16## alternative hypothesis: the CDF of x lies above that of y

# This standard test does not work well in our example.# Here is an alternative:perf <- performance(pred, "tpr", "fpr")ks<-max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])ks

## [1] 0.5337456

# visualize the KS in the plotplot(perf,main=paste0(' KS=',round(ks*100,1),'%'),

lwd=3, col='red')lines(x = c(0,1),y=c(0,1), col='blue')



KS=53.4%

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0



4.3 Analysis of variance

In general analysis of variance (aov or anova) is rejecting or accepting thezero-hypothesis that all groups are random and from the same population.

We use Regression analysis to create models which describe the effect ofvariation in predictor variables on the response variable. Sometimes, if wehave a categorical variable with values like Yes/No, Male/Female, grad-ing, etc. the simple regression analysis gives multiple results for each valueof the categorical variable. In such scenario, we can study the effect of thecategorical variable by using it along with the predictor variable and com-paring the regression lines for each level of the categorical variable. Suchan analysis is termed as Analysis of Covariance also called as ANCOVA.

Consider the following example, based on the built-in data set mtcars.In it we observe that the field “am” represents the type of transmission(auto or manual). It is a categorical variable with values 0 and 1. The milesper gallon value(mpg) of a car can also depend on it besides the value ofhorse power(“hp”).

We study the effect of the value of “am” on the regression between“mpg” and “hp”. It is done by using the aov() function followed by theanova() function to compare the multiple regressions.

The intuition that the gearbox of a car has an independent effect on thefuel consumption can be illustrated by plotting the histograms for cars withautomatic and cars with manual gearboxes. This is represented in Figure4.10 and it can be obtained by the following code.

h_m <- hist( mtcars[mtcars$am == 0,1],breaks=c(10,15,20,25,30,35))

h_a <- hist( mtcars[mtcars$am == 1,1])m_col <- rgb(0,0,1,0.2)a_col <- rgb(1,0,1,0.2)x_min <- min(mtcars$mpg)-1x_max <- max(mtcars$mpg)+1plot( h_m, col=m_col, xlim=c(x_min,x_max),xlab="MPG",

main="Histogram of mtcars")plot( h_a, col=a_col, xlim=c(x_min,x_max), add=TRUE )legend('topright',c('manual','automatic'),

fill=c(m_col,a_col, bty = 'n', border=NA))

The zero hypothesis is tested by the F-test so that

F =explained variance

unexplained variance=

variance between groupsvariance within groups

This allows us in practice to determine what variables have to be re-tained for a linear model.


4.3. ANALYSIS OF VARIANCE

Histogram of mtcars

MPG

Freq

uenc

y

10 15 20 25 30 35

02

46

810 manual

automatic

Figure 4.10: The histogram for fuel consumption for cars with different gearboxes.Note that the area where both plots overlap the colour is the sum of both colours.

ANOVA AnalysisWe create a regression model taking “hp” as the predictor variable and

“mpg” as the response variable taking into account the interaction between“am” and “hp”. Model with interaction between categorical variable andpredictor variable.

# Get the dataset.input <- mtcars

# Create the regression model.result <- aov(mpg˜hp*am,data = input)print(summary(result))

## Df Sum Sq Mean Sq F value Pr(>F)## hp 1 678.4 678.4 77.391 1.50e-09 ***## am 1 202.2 202.2 23.072 4.75e-05 ***## hp:am 1 0.0 0.0 0.001 0.981## Residuals 28 245.4 8.8## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This result shows that both horse power and transmission type has sig-nificant effect on miles per gallon as the p value in both cases is less than



0.05. But the interaction between these two variables is not significant asthe p-value is more than 0.05. So, we can consider a model without theseinsignificant variables

A more robust model

# Create the regression model.result <- aov(mpg˜hp+am,data = mtcars)print(summary(result))

## Df Sum Sq Mean Sq F value Pr(>F)## hp 1 678.4 678.4 80.15 7.63e-10 ***## am 1 202.2 202.2 23.89 3.46e-05 ***## Residuals 29 245.4 8.5## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This result shows that both horse power and transmission type has sig-nificant effect on miles per gallon as the p value in both cases is less than0.05. Comparing Two Models

anovaNow we can compare the two models to conclude if the interaction of

the variables is truly in-significant. For this we use the anova() function.

# Create the regression models.result1 <- aov(mpg˜hp*am,data = mtcars)result2 <- aov(mpg˜hp+am,data = mtcars)

# Compare the two models.anova(result1,result2)

## Analysis of Variance Table#### Model 1: mpg ˜ hp * am## Model 2: mpg ˜ hp + am## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 28 245.43## 2 29 245.44 -1 -0.0052515 6e-04 0.9806

As the p-value is greater than 0.05 we conclude that the interaction be-tween horse power and transmission type is not significant. So the mileageper gallon will depend in a similar manner on the horse power of the car inboth auto and manual transmission mode.


4.3. ANALYSIS OF VARIANCE

Note: Note that the last numbers of the output refer to the models that aregiven to the anova() function. This means that the probability given foreach model is the probability that it is better than the first one. To see this,try to add a third one.



4.4 Learning Machines

Forms of Learning

• supervised learning: the algorithm will learn from provided results(eg. we have data of good and bad credit customers)

• unsupervised learning: the algorithm groups observations accord-ing to a given criteria (eg. the algorithm classifies customers accord-ing to profitability without being told what good or bad is)

• reinforced learning: the algorithm learns from outcomes: rather thanbeing told what is good or bad, the system will get something like acost-function (eg. the result of a treatment, the result of a chess gameor the relative return of a portfolio in a competitive stock market).Another way of defining reinforced learning is that in this case theenvironment rather than the teacher provides the right outcomes

If feedback is available either in the form of correct outcomes (teacher)or can be deduced from the environment then the task is to fit a functionbased on example input and output. We distinguish two forms of inductivelearning:

• learning a discrete function is called classification

• learning a continuous function is called regression

These problems are equivalent to the problem of trying to approximatean unknown function f(). Since we don’t know f() we use a hypothesish() that will approximate f(). A first and powerful class of approximationsare linear functions and hence linear regressions.1

In the linear case we can fit just one line perfectly to two data points.However, if there are more observations then the line will in general not gothrough all observations. In other words given the hypothesis space HL ofall linear functions the problem is realizable if all observations are co-linear.Otherwise it is not realizable and we have to try to find a “good fit”.

Of course we could choose our hypothesis space H as large as all Turingmachines. However, this will still not be sufficient in case of conflictingobservations. Therefore we need a method that finds a hypothesis that isoptimal in a certain sense. The OLS method for linear regression is suchexample. The line will fit optimally all observations for the L2-norm in thatsense that the sum of the L2 distances between each point and the fittedline is minimal.

1Indeed, we argue here that linear regressions are a form of machine learning.


4.5. DECISION TREE

4.5 Decision Tree

A decision tree is another example of inductive machine learning. It is oneof the most intuitive methods and in fact many heuristics from bird de-termination to medical treatment are often summarized in decision treeswhen humans have to learn. The way of thinking that follows from a deci-sion tree comes natural: if I’m hungry then I check if I have cash on me, ifso then I buy a sandwich.

4.5.1 Essential Background

The linear additive decision treeA decision trees splits the feature space in rectangles and then fits in

each rectangle a very simple model such as a constant in each rectangle.

y = f(x) =N∑n=1

αnI {x ∈ Rn}

with x = (x1, . . . , xm) and I{b} the identity function so that I{b} :={ 1 if b0 if !b

x1 < 0.33

x2 > 0.5

α1 α2

x1 < 0.66

α3 x2 < 1

α4 α5 (0, 0) (1, 0)

(1, 1)(0, 1)

x1

x2

R1

R2

R3

R4

R5

Figure 4.11: An example of the decision tree on fake date represented in 2 ways:on the left the decision tree and on the right the regions Ri that can be identified inthe (x1, x2)-plane.

Note: From Figure 4.13 it will be clear that only a certain type of partition-ings can be obtained. For example overlapping rectangles will never bepossible with a decision tree, nor will round shapes be obtained.

The CART method In order to fit the tree (or regression), we need toidentify a measure on “how good the fit is”. If we adopt the popular sum



of squares criterion,∑

(yi − f(xi))2, then it can be shown that the average

of values of observations yi in each region Ri should be the estimate for yi:

yi = avg(yi|xi ∈ Ri)

Unfortunately finding the best split of the tree for the minimum sum ofsquares criterion is in general not practical. One way to solve this problemis working with a “greedy algorithm”. Start with all data and consider asplitting variable j (in the previous example we had only 2 variables x1 andx2, so j would be 1 or 2) and a splitting value xsj so that we can define thefirst region R1 as

R1(j, xsj) := {x|xj < xsj}

and by doing so we –of course– obtain also a second region R2:

R2(j, xsj) := {x|x /∈ R1} = {x|xj ≥ xsj}

The best splitting variable is then the one that minimizes the sum ofsquares between estimates and observations

minj,s

miny1

∑xi∈R1(j,s)

(yi − y1)2 +miny1∑

xi∈R2(j,s)

(yi − y2)2

(4.1)

For any pair (j, xsj) we can solve the minimizations with the previouslydiscussed average as estimator:y1 = avg

[yi|xi ∈ R1(j, xsj)

]y2 = avg

[yi|xi ∈ R2(j, xsj)

]This is computationally quite fast to solve and therefore it is feasible to de-termine the split-point xsj for each attribute xi and hence completely solveEquation 4.1. This in its turn solves the optimal split for each pair (j, xsj),so that the optimal split variable (attribute) can be determined. This is thenthe first node of the tree. This procedure leads to a partition of the originaldata in two sets. Each set can be split once more with the same approach.This procedure can then be repeated on each sub-region.

Probably it comes natural to the reader to ponder at this point how largeand complex we should grow the tree as well as it probably comes naturalto think that stopping for a given minimal improvement in least squaresmakes sense. This would work if all variables xi would be independentlydistributed. However in reality it is possible to see complex interactions.Let’s consider the imaginary situation where credit risk for males improveswith age, but deteriorates with age for women, while on average there islittle difference between creditworthiness between the average man and


4.5. DECISION TREE

average woman. In that case splitting the population on the attribute “sex”does not yield much improvement overall. However, once we only see oneof the sexes the decreasing or increasing trend can yield a significant gain insum of squares. It might also be clear that first splitting on “age” would notwork either if both men and women are equally represented, because thenthe trends would cancel each other out. Once the split in age group is done,the split according to “sex” will in the next node dramatically improve thesum of squares).

Tree PruningSo, the only solution to the problem of optimal tree size is to allow the

three to grow to a given size that is too large and overfits the data and onlythen reduce the size. This reducing of size is called “pruning”.

The idea is to minimize the “cost of complexity function” for a givenpruning parameter α. The cost function is defined as

Cα(T ) :=

|ET |∑n=1

SEn(T ) + α|T | (4.2)

This is the sum of squares in each end-note plus α times the size of the tree.|T | is the number of terminal nodes in the sub-tree T (T is a subtree to T0 ifT has only nodes of T0), |ET | is the number of end-nodes in the tree T andSEn(T ) is the sum of squares in the end-node n for the tree T . The squareerrors in node n (or in region Rn) also equals:

SEn(T ) = Nn MSEn(T )

= Nn1

Nn

Nn∑xi∈Rn

(yi − yn)2

=

Nn∑xi∈Rn

(yi − yn)2

with yn the average of all yi in the region n as explained before.It appears that for each α there is a unique smallest sub-tree, Tα that

minimises the cost function Cα(T ). A good way to find Tα is to use “weak-est link pruning”. This process works a follows: collapse that internal nodethat produces the smallest increase in SEn(T ) and continue this till we havea single node tree. This produces a finite series of trees and it can be shownthat this series includes Tα.

This process leaves the pruning parameter α up to the us. Larger valuesof α lead to smaller trees. With α = 0 one will get the full tree.



Classification TreesIn case the values yi do not come from a numerical function but are

rather a nominal or ordinal scale2 it is not longer possible to use MSE asa measure of fitness for the model. In that case we can use the averagenumber of matches with class c:

pn,c :=1

Nc

∑xi∈Rn

I{yi = c} (4.3)

The class c that has the highest proportion pn,c, defined as argmaxc(pm,k).This is the value that we will assign in that node. The node impurity thencan be calculated by one of the following:

Gini index =∑c 6=c

pn,cpn,c =C∑c=1

pn,c(1− pn,c)

(4.4)

cross-entropy or deviance = −C∑c=1

pn,clog2(pn,c) (4.5)

misclassification error =1

Nn

∑xi∈Rn

I{yi = c} = 1− pn,c

(4.6)

with C the total number of classes.With R we can plot the different measures of impurity. Note that the

entropy based measure has a maxium of 1, while the others are limited to0.5 so for the representation it has been divided by 2 for the plot.

## draw gini, deviance and misclassification functions##gini <- function(x) 2 * x * (1-x)entr <- function(x) (-x*log(x) - (1-x)*log(1-x))/log(2) / 2misc <- function(x) {1 - pmax(x,1-x)}curve(gini,0,1,ylim=c(0,0.5),col="forestgreen",lwd=3,

xlab="p", ylab="impurity measure", type="l")curve(entr,0,1,add=TRUE,lw=3,col="black")curve(misc,0,1,add=TRUE,lw=3,col="blue",type="l", lty=2)text(0.85,0.4,"gini index",col="forestgreen")text(0.85,0.485,"deviance or cross-entropy",col="black")text(0.5,0.3,"misclassification index",col="blue")

The plot produced by this code is in Figure 4.12 on page 135.The behaviour of the misclassification error around p = 0.5 is not differ-

entiable and might lead to abrupt changes in the tree for small differences2For more details about ordinal and nominal scales we refer to Chapter 8 on page 217


4.5. DECISION TREE

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

impu

rity

mea

sure

gini index

deviance or cross−entropy

misclassification index

Figure 4.12: Three alternatives for the impurity measure in the case of classifica-tion problems.

in data (maybe just one point that is forgotten). Because of that property ofdifferentiability cross-entropy and gini index are better suited for numer-ical optimisation. It also appears that both Gini index and deviance aremore sensitive to changes in probabilities in the nodes. For example it ispossible to find cases where the misclassification error would not prefer apure node.

Note also the that Gini index can be interpreted in some other usefulways. For example if we code the values 1 for observations of class c thenthe variance over the node of this 0-1 response variable is pn,c(1− pn,c) andsumming over all classes c we find the Gini index.

Also if one would not assign all observations in the node to class c butrather assign them to class c with a probability equal to pn,c then the Gini-coefficient is the training error rate in node n:

∑c 6=c pn,cpn,c

For pruning all three measures can be trusted. Usually the misclassifi-cation error is used.

The procedure described in this chapter is known as the CART method(Classification and Regression Tree)

Binary classificationWhile largely covered by the explanation above, it is worth to take a few

minutes and study the particular case where the output variable is binary:



true or false, good or bad, 0 or 1. This is not only a very important case, butit also allows us to make the parallel with information theory.

Binary classification are important cases in everyday practice: good orbad credit risk, sick or not, death or alive, etc.

The mechanism to fit the tree works exactly the same. From all at-tributes, choose that one that classifies the results the best. Split the data-setaccording to the value that bests separates the goods from the bads.

Now we need a way to tell what is a good split. This can be done byselecting the attribute that has the most information value. The information–measured in bits– of outcomes xi with probabilities Pi is

I(P1, . . . , PN ) = −N∑i=1

Pi log2(Pi)

Which in the case of two possible outcomes (G the number of “good” ob-servations and B the number of “bad” observations) reduces to

I

(G

G+B,

B

G+B

)= − G

G+Blog2

(G

G+B

)− B

G+Blog2

(B

G+B

)The amount of information provided by each attribute can be found by

estimating the amount of information that still is to be gathered after theattribute is applied. Assume an attribute A (for example age of the poten-tial creditor), and assume that we use 4 bins for the attribute age. In thatcase the attribute age will separate our set of observations in four sub-sets:S1, S2, S3, S4. Each subset Si then has Gi good observations and Bi bad ob-servations. This means that if the attribute A is applied that we still needI(

GiGi+Bi

, BiGi+Bi

)bits of information to classify all observations correctly.

A random sample from our data-set will be in the ith age group with aprobability Pi := Bi+Bi

G+B , since Bi + Gi equals the number of observationsin the ith age group. Hence the remaining information after applying theattribute age is:

remainder(A) =

4∑i=1

PiI

(Gi

Gi +Bi,

BiGi +Bi

)This means that the information gain of applying the attribute A is

gain(A) = I

(G

G+B,

B

G+B

)− remainder(A)

So, it is sufficient to calculate the information gain for each attribute andsplit the population according to the attribute that has the highest informa-tion gain.


4.5. DECISION TREE

4.5.2 Important considerationsDecision trees have many advantages: they’re intuitive and can be repre-sented in a 2D graph. The nodes in the graph represent an event or choiceand the edges of the graph represent the decision rules or conditions. Thismakes the tool very accessible and an ideal method if a transparent modelis needed.

Broadening the ScopeDecision trees raise some particular possibilities, we briefly discuss some

below.

1. Loss Matrix: in reality a misclassification in one or the other groupdoes not come at the same cost. If we identify the tumour wronglyas cancer the health insurance will loose some money for an unnec-essary treatment, but if we misclassify the tumour as harmless thenthe patient will die. That is an order of magnitude worse. A bankfor example might wrongly reject a good customer and fail to get andincome of 1000 on a loan of 10’000. However, if the bank accepts thewrong customer then the bank can loose 10’000 or more in recoverycosts. This can be mitigated with a loss matrix. Define a C × C lossmatrix L, with Lkl the loss incurred by misclassifying a class k as aclass l. As a reference one usually takes a correct classification as azero cost: Lkk = 0. Now it is sufficient to modify the Gini-indexas∑C

k 6=l Lklpnkpnl. This works fine when C > 2, but unfortunatelythe symmetry of a two-class problem this has no effect since the co-efficient of pnkpnl is (Lkl + Llk). The workaround is weighting theobservations with of class k by Lkl.

2. Missing Values: Missing values are a problem for all statistical meth-ods. The classical approach of leaving such observation out or tryingto fill them in via some model have serious issues if the fact that datais missing has a specific reason. For example males might not fillin the “sex” information if they know that it will increase the priceof their car-insurance. In this particular case no gender informationcan be worse insurance risk than “male” if only the males that al-ready had issues learned this trick. Decision trees allow for anothermethod: assign a specific value “missing” to that predictor value. Al-ternatively one can use surrogate splits: first work only with the datathat has no missing fields and then try to find alternative variablesthat provide the same split.

3. linear combination splits: In our approach each node has a simpledecision model of the form xi ≤ xsj . One can consider decisions of



the form∑αixi ≤ xsj . Depending on the particular case this might

considerably improve the predictability of the model, however it willbe more difficult to understand what happens. In the ideal case thesenodes would lead to some clear attributes such as “risk seeking be-haviour” (where the model might create this concept out of a combi-nation of “male”, “age group 1” and “martial status unmarried”) oraffordability to pay (“martial status married’, “has a job longer than2 years”). In general this will hardly happen and it becomes unclearwhat exactly is going on and why the model is refusing the loan. Thisin its turn makes it more difficult to assess if a model is over-fit or not.

4. link with ANOVA: An alternative way to understand the ideal stop-ping point is using the ANOVA approach. The impurity in a nodecan be thought of as the MSE in that node.

MSE =

n∑i=1

(yi − y)2

With yi the value of the ith observation and y the average of all obser-vations.

This node impurity can also be thought of as in ANOVA analyses.

SSbetweenB−1

SSwithinn−B

∼ Fn−B,B−1

with {SSbetween = nb

∑Bb=1 (yb − y)2

SSwithin =∑B

b=1

∑nbi=1 (ybi − y)2

with B the number of branches, nb the number of observations inbranch b, ybi the value of observation bi.

Now optimal stopping can be determined by using measures of fitand relevance as in a linear regression model. For example one canrely on R2, MAD, etc.

5. Other Tree Building Procedures: The method described so far isknown as CART (classification and regression tree). Other popularchoices are ID3 (and its successors C4.5, C5.0) and MARS.

Issues

1. over-fitting: this is one of the most important issues with decisiontrees. It should never be used without appropriate validation meth-ods such as cross-validation or random forest approach before an ef-fort to prune the tree. See eg. Hastie et al. (2009)


4.5. DECISION TREE

2. categorical predictor values: Categorical variables that represent anominal or ordinal scale present a specific challenge. If the variablehas κ possible values, then the number of partitions that can be madeis 2κ−1 − 1. In general the algorithms tend to favour categorical pre-dictors with many levels with the number of partitions growing ex-ponentially in κ, which will lead to severe over-fitting.

3. Instability: small changes in data can lead to dramatically differenttree structures. This is because even a small change on the top ofthe tree will be cascaded down the tree. This works very differentin linear regression models for example where one additional data-point (unless it is an outlier) will only have a small influence on theparameters of the model. Methods such as Random Forest alleviatethis instability, but also bagging of data will improve the stability.

4. Difficulties to capture Additive Relationships: a decision tree natu-rally will fit decisions that are not additive. For example if a personhas the affordability to pay and he is honest then he will pay the loanback. This would work fine in a decision tree, if however the fact ofthe customer paying back the loan depends on many other factorsthat have to be all in place and can mitigate each other then a additiverelationship might work better.3

5. Stepwise Predictions: the outcome of a decision trees will naturallybe approximated by a step-function. Modelling a linear relationshipfor example is not efficient as it will be approximated by a number ofdiscrete steps. The MARS procedure allows to alleviate this to someextend.

4.5.3 Growing trees with Rwe will use the full data-set of the Titanic data:

t <- read.csv("data/titanic3.csv")

The dataset has the following definitions:

3An additional logic to predict payment capacity would be for example having a stablejob, having a diploma, having a spouse that can step in, having savings, etc. Indeed allthose things can be added and if the spouse is gone the diploma will help to find a nextjob: these variables compensate. This is not the case for over-indebted, fraud (no intentionto pay), unstable job, etc. in this case there is no compensation: one of those things goingwrong will result in a customer that does not pay his loan back.



# survived : Survival (0 = No; 1 = Yes)# class : Class of travelling (1 = 1st; 2 = 2nd; 3 = 3rd)# name : passenger name# sex : gender# age : age# sibsp : number of siblings/spouses aboard# parch : Number of parents/children aboard# ticket : ticket number# fare : fare pad# cabin : cabin number# embarked : port of embarkation (C = Cherbourg;# Q = Queenstown;# S = Southampton)# boat : lifeboat reference (if survived)# body : body number (if recovered)# home.dest: home destination

R does an excellent job in taking much of the complexity away and thereexist packages that provide tools to grow and optimize trees as well as tovisualize the results and report results.

There are a few packages in R that provide excellent results. One of themost popular ones is the package tree.

It is very straightforward to use:

Example for the tree()

require(tree)

## Loading required package: tree

frm <- survived ˜ pclass+sex+sibsp+parch+embarked+home.desttr <- tree(frm, data=t)

## Error in tree(frm, data = t): factor predictors must haveat most 32 levels

# note: this fails because (home.dest has too many levels)# so, let's simplifyfrm <- survived ˜ pclass + sex + sibsp + parch + embarkedtr <- tree(frm, data=t)summary(tr)

#### Regression tree:## tree(formula = frm, data = t)## Variables actually used in tree construction:## [1] "sex" "pclass"## Number of terminal nodes: 4## Residual mean deviance: 0.1494 = 195 / 1305


4.5. DECISION TREE

|sex:a

pclass < 2.5 pclass < 1.5

0.9320 0.4907

0.3408 0.1506

Figure 4.13: A decision tree on the data of the Titanic disaster.

## Distribution of residuals:## Min. 1st Qu. Median Mean 3rd Qu. Max.## -0.9320 -0.1506 -0.1506 0.0000 0.0680 0.8494

The tree can be visualized with the following simple commands

plot(tr); text(tr)

It is obvious that the more nodes (branches) we use in our tree, thebetter our predictions will be. Of course, that is only true on the data-setthat we have in front of us. In reality that is usually of little importance.For example, the data are the historic defaults of customers in the past, butwhat really matters is to infer future defaults that are not known at thispoint in time. This means that we rather look for a model that is robustthan a perfect fit, because that would bring the vulnerabilities of an over-fit model. This means that we look for finding patterns that really matter,patterns that we expect to continue to be true and that it matters less tohave the best predictions possible on our existing data-set.4

4It is important to understand what one tries to do, statistics is a tool just like a hammer(it works fine on nails, but not so well on screws). Of course descriptive statistics is usefulto understand phenomena. For example one can investigate the survivor data of the Titanicdisaster to understand better what happened. However an insurer should be interested inthe power to predict the future based on the data that it has today.



Related functions

• prune.tree(): find the series of optimal trees where in each stepthe least important nodes are collapsed

• tree.control(): to be used within the tree function to control thesize of the desired tree

• predict.tree(): returns a vector of predictions from a tree object

• snip.tree(): a function to manually edit a given tree (snip offsome nodes)

A more powerful alternative with rpartAlso the function rpart() of the rpart library provides a good im-

plementation of the CART algorithm.

1. load the package rpart

2. fit the tree with t <- rpart(formula, data, weights, subset,na.action = na.rpart, method=c(‘‘class’’,‘‘anova’’),model = FALSE, x = FALSE, y = TRUE, parms, control,cost, ...) and eventually control the size of the tree with control= rpart.control(minsplit=20, cp=0.01). This example willset the minimum number of observations in a node to 20 and that asplit must decrease the overall impurity of fit by a factor of 0.01 (costcomplexity factor) before being attempted.

3. investigate and visualize the results with

• printcp(t) to visualize the cp (complexity parameter) table5

• plotcp(t) to plot cross-validation test

• rsq.rpart(t) to plot R-squared and relative error for differ-ent splits (2 plots). Note that labels are only appropriate for the“anova” method.

• print(t) print results

• summary(t): to display the results (with surrogate splits)

• plot(t): to plot the decision tree

• text(t): to add labels to the plot

5The complexity parameter (cp) is used to control the size of the decision tree and toselect the optimal tree size. If the cost of adding another variable to the decision tree fromthe current node is above the value of cp, then tree building does not continue. We couldalso say that tree construction does not continue unless it would decrease the overall lack offit by a factor of cp. The complexity parameter as implemented in the package rpart is notthe same as Equation 4.2 but follows from a similar build-up: Ccp(T ) := C(T )+cp |T |C(T0)


4.5. DECISION TREE

• post(t, file="") to create create a postscript file of the plot

More information about the rpart package can be found in the docu-mentation of the project: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.

Example of a regression tree with rpart

## example of a regression tree with rpart on the dataset mtcars##library(rpart); library(MASS)t <- rpart(mpg ˜ cyl + disp + hp + drat + wt + qsec + am + gear,data=mtcars, na.action = na.rpart,method="anova",control = rpart.control(minsplit = 10, # minimum number of observations required for splitminbucket = 20/3,# minimum number of observations in a terminal node

# the default = minsplit/3cp = 0.01,# complexity parameter set to a very small value

# his will grow a large (over-fit) treemaxcompete = 4, # number of competitor splits retained in the outputmaxsurrogate = 5, # number of surrogate splits retained in the outputusesurrogate = 2, # how to use surrogates in the splitting processxval = 7, # number of cross-validationssurrogatestyle = 0, # controls the selection of a best surrogatemaxdepth = 30) # maximum depth of any node of the final tree)

printcp(t)

#### Regression tree:## rpart(formula = mpg ˜ cyl + disp + hp + drat + wt + qsec + am +## gear, data = mtcars, na.action = na.rpart, method = "anova",## control = rpart.control(minsplit = 10, minbucket = 20/3,## cp = 0.01, maxcompete = 4, maxsurrogate = 5, usesurrogate = 2,## xval = 7, surrogatestyle = 0, maxdepth = 30))#### Variables actually used in tree construction:## [1] cyl disp hp wt#### Root node error: 1126/32 = 35.189#### n= 32#### CP nsplit rel error xerror xstd## 1 0.652661 0 1.00000 1.06513 0.25148## 2 0.194702 1 0.34734 0.70272 0.16889## 3 0.035330 2 0.15264 0.53109 0.11867## 4 0.014713 3 0.11731 0.44271 0.11762## 5 0.010000 4 0.10259 0.44271 0.11762

plotcp(t)

print(t)

## n= 32##



●

●

●

● ●

cp

X−va

l Rel

ative

Erro

r

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Inf 0.36 0.083 0.023 0.012

1 2 3 4 5

size of tree

Figure 4.14: the plot of the complexity parameter (cp) via the function plotcp()

## node), split, n, deviance, yval## * denotes terminal node#### 1) root 32 1126.04700 20.09062## 2) wt>=2.26 26 346.56650 17.78846## 4) cyl>=7 14 85.20000 15.10000## 8) hp>=192.5 7 28.82857 13.41429 *## 9) hp< 192.5 7 16.58857 16.78571 *## 5) cyl< 7 12 42.12250 20.92500## 10) disp>=153.35 6 12.67500 19.75000 *## 11) disp< 153.35 6 12.88000 22.10000 *## 3) wt< 2.26 6 44.55333 30.06667 *

summary(t)

## Call:## rpart(formula = mpg ˜ cyl + disp + hp + drat + wt + qsec + am +## gear, data = mtcars, na.action = na.rpart, method = "anova",## control = rpart.control(minsplit = 10, minbucket = 20/3,## cp = 0.01, maxcompete = 4, maxsurrogate = 5, usesurrogate = 2,## xval = 7, surrogatestyle = 0, maxdepth = 30))## n= 32#### CP nsplit rel error xerror xstd## 1 0.65266121 0 1.0000000 1.0651257 0.2514760## 2 0.19470235 1 0.3473388 0.7027190 0.1688902## 3 0.03532965 2 0.1526364 0.5310894 0.1186716## 4 0.01471297 3 0.1173068 0.4427135 0.1176233## 5 0.01000000 4 0.1025938 0.4427135 0.1176233#### Variable importance## wt disp hp drat cyl qsec


4.5. DECISION TREE

## 25 24 20 15 10 5#### Node number 1: 32 observations, complexity param=0.6526612## mean=20.09062, MSE=35.18897## left son=2 (26 obs) right son=3 (6 obs)## Primary splits:## wt < 2.26 to the right, improve=0.6526612, (0 missing)## cyl < 5 to the right, improve=0.6431252, (0 missing)## disp < 163.8 to the right, improve=0.6130502, (0 missing)## hp < 118 to the right, improve=0.6010712, (0 missing)## drat < 3.75 to the left, improve=0.4186711, (0 missing)## Surrogate splits:## disp < 101.55 to the right, agree=0.969, adj=0.833, (0 split)## hp < 92 to the right, agree=0.938, adj=0.667, (0 split)## drat < 4 to the left, agree=0.906, adj=0.500, (0 split)## cyl < 5 to the right, agree=0.844, adj=0.167, (0 split)#### Node number 2: 26 observations, complexity param=0.1947024## mean=17.78846, MSE=13.32948## left son=4 (14 obs) right son=5 (12 obs)## Primary splits:## cyl < 7 to the right, improve=0.6326174, (0 missing)## disp < 266.9 to the right, improve=0.6326174, (0 missing)## hp < 136.5 to the right, improve=0.5803554, (0 missing)## wt < 3.325 to the right, improve=0.5393370, (0 missing)## qsec < 18.15 to the left, improve=0.4210605, (0 missing)## Surrogate splits:## disp < 266.9 to the right, agree=1.000, adj=1.000, (0 split)## hp < 136.5 to the right, agree=0.962, adj=0.917, (0 split)## wt < 3.49 to the right, agree=0.885, adj=0.750, (0 split)## qsec < 18.15 to the left, agree=0.885, adj=0.750, (0 split)## drat < 3.58 to the left, agree=0.846, adj=0.667, (0 split)#### Node number 3: 6 observations## mean=30.06667, MSE=7.425556#### Node number 4: 14 observations, complexity param=0.03532965## mean=15.1, MSE=6.085714## left son=8 (7 obs) right son=9 (7 obs)## Primary splits:## hp < 192.5 to the right, improve=0.46693490, (0 missing)## wt < 3.81 to the right, improve=0.13159230, (0 missing)## qsec < 17.35 to the right, improve=0.13159230, (0 missing)## drat < 3.075 to the left, improve=0.09982394, (0 missing)## disp < 334 to the right, improve=0.05477308, (0 missing)## Surrogate splits:## drat < 3.18 to the right, agree=0.857, adj=0.714, (0 split)## disp < 334 to the right, agree=0.786, adj=0.571, (0 split)## qsec < 16.355 to the left, agree=0.786, adj=0.571, (0 split)## wt < 4.66 to the right, agree=0.714, adj=0.429, (0 split)## am < 0.5 to the right, agree=0.643, adj=0.286, (0 split)#### Node number 5: 12 observations, complexity param=0.01471297## mean=20.925, MSE=3.510208## left son=10 (6 obs) right son=11 (6 obs)## Primary splits:## disp < 153.35 to the right, improve=0.393317100, (0 missing)## hp < 109.5 to the right, improve=0.235048600, (0 missing)## drat < 3.875 to the right, improve=0.043701900, (0 missing)## wt < 3.0125 to the right, improve=0.027083700, (0 missing)## qsec < 18.755 to the left, improve=0.001602469, (0 missing)## Surrogate splits:



|wt>=2.26

cyl>=7

hp>=192.5 disp>=153.3

13.41 16.7919.75 22.1

30.07

Figure 4.15: rpart tree on mpg for the dataset mtcars.

## cyl < 5 to the right, agree=0.917, adj=0.833, (0 split)## hp < 101 to the right, agree=0.833, adj=0.667, (0 split)## wt < 3.2025 to the right, agree=0.833, adj=0.667, (0 split)## drat < 3.35 to the left, agree=0.667, adj=0.333, (0 split)## qsec < 18.45 to the left, agree=0.667, adj=0.333, (0 split)#### Node number 8: 7 observations## mean=13.41429, MSE=4.118367#### Node number 9: 7 observations## mean=16.78571, MSE=2.369796#### Node number 10: 6 observations## mean=19.75, MSE=2.1125#### Node number 11: 6 observations## mean=22.1, MSE=2.146667

plot(t)text(t)

# now prune the tree:t1 <- prune(t, cp=0.1)plot(t1); text(t1)

Example of a classification tree with rpartIn practice the data itself is usually biased. For example a bank only

gives loans to customers when they are expected to pay back the loan. This


4.5. DECISION TREE

|wt>=2.26

cyl>=7

15.1 20.92

30.07

Figure 4.16: The same tree as in Figure 4.18 but now pruned with a complexityparameter ρ of 0.1.

means that any dataset of existing customers will show an overwhelmingamount of “good” customers compared to very few customers that de-faulted on their loans. This means that variance based methods such as adecision tree will be heavily biased towards reliable customers, the methodwill try to classify on average each customer as good or bad. The amountof bad customers will never generate the same amount of variance.

Imagine that we have a database of 100’000 customers from which 1’000defaulted on their loans. There are multiple ways of addressing this prob-lem:

1. copy the existing bad customers a 99 times

2. drop 98’000 good customers randomly

3. weight the observations.

The function rpart allows to correct these prior probabilities via the vari-able parms

The death toll of the Titanic disaster was very high and 500 of the 1309passengers survived. This means that the “prior probability of survivingis about 0.3819 = 500

1309 . So, this particular dataset has a similar amount ofgood and bad results. In this case one might omit the part



parms=list(prior = c(0.6,0.4))

## example of a regression tree with rpart on the dataset mtcars##titanic <- read.csv("data/titanic3.csv")frm <- survived ˜ pclass + sex + sibsp + parch + embarked + aget0 <- rpart(frm, data=titanic, na.action = na.rpart,method="class",parms=list(prior = c(0.6,0.4)),control = rpart.control(minsplit = 50, # minimum number of observations required for splitminbucket = 20, # minimum number of observations in a terminal nodecp = 0.001,# complexity parameter set to a very small value

# his will grow a large (over-fit) treemaxcompete = 4, # number of competitor splits retained in the outputmaxsurrogate = 5, # number of surrogate splits retained in the outputusesurrogate = 2, # how to use surrogates in the splitting processxval = 7, # number of cross-validationssurrogatestyle = 0, # controls the selection of a best surrogatemaxdepth = 6) # maximum depth of any node of the final tree)

printcp(t0)

#### Classification tree:## rpart(formula = frm, data = titanic, na.action = na.rpart, method = "class",## parms = list(prior = c(0.6, 0.4)), control = rpart.control(minsplit = 50,## minbucket = 20, cp = 0.001, maxcompete = 4, maxsurrogate = 5,## usesurrogate = 2, xval = 7, surrogatestyle = 0, maxdepth = 6))#### Variables actually used in tree construction:## [1] age embarked pclass sex sibsp#### Root node error: 523.6/1309 = 0.4#### n= 1309#### CP nsplit rel error xerror xstd## 1 0.4425241 0 1.00000 1.00000 0.035158## 2 0.0213115 1 0.55748 0.55748 0.029038## 3 0.0092089 3 0.51485 0.52769 0.028762## 4 0.0073337 4 0.50564 0.54077 0.028936## 5 0.0010000 6 0.49098 0.53156 0.028644

plotcp(t0)

# print(t0) to avoid too long output we left this out# summary(t0)plot(t0)text(t0)

# now prune the tree:t1 <- prune(t0, cp=0.01)plot(t1); text(t1)


4.5. DECISION TREE

●

●

●●

●

cp

X−va

l Rel

ative

Erro

r

0.4

0.6

0.8

1.0

Inf 0.097 0.014 0.0082 0.0027

1 2 4 5 7

size of tree

Figure 4.17: the plot of the complexity parameter (cp) via the function plotcp()

|sex=b

age>=9.5 pclass>=2.5

embarked=d

sibsp>=1.5

age>=27.5

0 1

00 1

1

1

Figure 4.18: rpart tree predicting survival in the Titanic disaster.



|sex=b

pclass>=2.5

embarked=d0

0 1

1

Figure 4.19: The same tree as in Figure 4.18 but now pruned with a complexityparameter ρ of 0.01.

Visualizing the tree with rpart.plot

When plotting the tree we used the standard method that is suppliedwith the library rplot: the plot.rpart and text.rpart. The packagerpart.plot extends this both with useful information and visual pleas-ing effects. There are options specific for classification trees and others forregression trees.

# plot the tree with rpart.plotlibrary(rpart.plot)prp(t0, type =5, extra = 8, box.palette ="auto",

#branch.type=5,yesno=1, yes.text="survived",no.text="dead"

)

The function prp() takes many arguments and allows the user to writefunctions to obtain exactly the desired result.


4.5. DECISION TREE

mal

>= 9.5 >= 3

S

>= 2

>= 28

fml

< 9.5 < 3

C,Q

< 2

< 28

sex

age

00.82

10.55

pclass

embarked

sibsp

00.87

age

00.67

10.56

10.65

10.94

Figure 4.20: The decision tree represented by the function prp() from the packagerpart.plot

4.5.4 Evaluating the performance of a decision tree

The performance of the regression tree

MSE, MAD, etc. (TBA)

The performance of the classification tree

We continue the example of Titanic disaster of previous chapter. First weneed to make some predictions.

predicPerc <- predict(object=t0, newdata=titanic)# this is a matrix with probabilities# (column 1 not to survive, column 2 survived)

head(predicPerc)

## 0 1## 1 0.06335498 0.9366450## 2 0.44928523 0.5507148## 3 0.06335498 0.9366450## 4 0.81814936 0.1818506## 5 0.06335498 0.9366450## 6 0.81814936 0.1818506



predic <- predict(object=t0, newdata=titanic, type="class")# vector with only the fitted class as prediction

head(predic)

## 1 2 3 4 5 6## 1 1 1 0 1 0## Levels: 0 1

A first and very simple approach is to calculate the confusion matrix.This confusion matrix shows the correct classifications and misclassifica-tions.

# the confusion matrixconfusion_matrix <- table(predic, titanic$survived)rownames(confusion_matrix) <- c("predicted_death",

"predicted_survival")colnames(confusion_matrix) <- c("observed_death",

"observed_survival")confusion_matrix

#### predic observed_death observed_survival## predicted_death 706 150## predicted_survival 103 350

# as a precentage:confusion_matrixPerc <- sweep(confusion_matrix, 2,

margin.table(confusion_matrix,2),"/")round(confusion_matrixPerc,2)

#### predic observed_death observed_survival## predicted_death 0.87 0.30## predicted_survival 0.13 0.70

The ROC curve can be obtained via the package ROCR

library(ROCR)pred <- prediction(predict(t0, type = "prob")[,2],

titanic$survived)

#visualize the ROC curveplot(performance(pred, "tpr", "fpr"), col="blue", lwd=3)abline(0,1,lty=2)

plot(performance(pred, "acc"), col="blue", lwd=3)abline(1,0,lty=2)


4.5. DECISION TREE

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.21: ROC curve

Cutoff

Accu

racy

0.2 0.4 0.6 0.8

0.4

0.5

0.6

0.7

0.8

Figure 4.22: The accuracy for the decision tree on the Titanic data.



Finally we mention the AUC, GINI and KS.

# AUCAUC <- attr(performance(pred, "auc"), "y.values")[[1]]AUC

## [1] 0.816288

# GINI2 * AUC - 1

## [1] 0.632576

# KSperf <- performance(pred, "tpr", "fpr")max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])

## [1] 0.5726823


4.6. RANDOM FOREST

4.6 Random Forest

Decision trees are a popular method for various machine learning tasks.Tree learning is invariant under scaling and various other transformationsof feature values, is robust to inclusion of irrelevant features, and producesmodels that are transparent and can be understood by humans.

However, they are seldom robust. In particular, trees that are grownvery deep tend to learn highly irregular patterns: they “overfit” their train-ing sets, i.e. have low bias, but very high variance. Random forests are away of averaging multiple deep decision trees, trained on different parts ofthe same training set, with the goal of reducing the variance (by averagingthe trees). This comes at the expense of a small increase in the bias andsome loss of interpretability, but generally greatly boosts the performancein the final model.

Random forests uses a combination of techniques to counteract over-fitting. Firstly many samples of data are selected and secondly variablesare put in and out the formula.

With this process a measure of importance of an attribute (input vari-able) which in its turn can be used to select a model. This can be particularlyuseful when forward/backward stepwise selection is not appropriate andwhen working with an extremely high number of candidate variables thatneed to be reduced.

library(randomForest)

## randomForest 4.6-12## Type rfNews() to see new features/changes/bug fixes.## ## Attaching package: ’randomForest’## The following object is masked from ’package:gridExtra’:#### combine## The following object is masked from ’package:ggplot2’:## ##margin

The process of generating a random forest needs two steps that requirea source of randomness, so it is a good idea to set the random seed in Rbefore you begin. Doing so makes your results reproducible next time yourun the code. This is done by the function set.seed(n).

mtcars$l <- NULL # remove our variablefrm <- mpg ˜ cyl + disp + hp + drat + wt + qsec + am + gearset.seed(112)fit.rf = randomForest(frm, data=mtcars)print(fit.rf)

##



0 100 200 300 400 500

1015

20

fit.rf

trees

Erro

r

Figure 4.23: The plot of a randomForest object shows how the model improves infunction of the number of trees used.

## Call:## randomForest(formula = frm, data = mtcars)## Type of random forest: regression## Number of trees: 500## No. of variables tried at each split: 2#### Mean of squared residuals: 6.15652## % Var explained: 82.5

importance(fit.rf)

## IncNodePurity## cyl 181.70973## disp 248.17963## hp 195.82634## drat 86.28333## wt 230.51709## qsec 48.94831## am 22.91795## gear 36.19651

plot(fit.rf)


4.6. RANDOM FOREST

●

●

●

●

●

●

●

●

1 2 3 4 5 6 7 8

5010

015

020

025

0

Index

impo

rtanc

e(fit

.rf)

Figure 4.24: The importance of each variable in the random-forest model.

plot( importance(fit.rf), lty=2, pch=16)lines(importance(fit.rf))

imp = importance(fit.rf)impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)]op = par(mfrow=c(1, 3))for (i in seq_along(impvar)) {partialPlot(fit.rf, mtcars, impvar[i], xlab=impvar[i],main=paste("Partial Dependence on", impvar[i]))

# visualization of the RFgetTree(fit.rf, 1, labelVar=TRUE)}

The random forest method learns us that in this case one might makea tree fitting on the number of cylinders, displacement, horse power andweight without fear that the model is overfitted to this particular data-set.6

6Of course in reality one has to be very careful. For example something can happen thatdisturbs the dependence of the parameters. For example once electric or hybrid cars areintroduced this will look quite different.



100 200 300 400

1920

2122

Partial Dependence on disp

disp

2 3 4 5

1920

2122

Partial Dependence on wt

wt

50 100 200 300

19.0

19.5

20.0

20.5

21.0

21.5

Partial Dependence on hp

hp

Figure 4.25: Partial dependence on the variables.

4 5 6 7 8

19.0

19.5

20.0

20.5

21.0

21.5

Partial Dependence on cyl

cyl

3.0 3.5 4.0 4.5 5.0

19.4

19.6

19.8

20.0

20.2

20.4

20.6

Partial Dependence on drat

drat

16 18 20 22

19.9

20.0

20.1

20.2

20.3

20.4

20.5

Partial Dependence on qsec

qsec

Figure 4.26: The plot of a randomForest object shows how the model improves infunction of the number of trees used.


4.6. RANDOM FOREST

3.0 3.5 4.0 4.5 5.0

19.9

20.0

20.1

20.2

20.3

Partial Dependence on gear

gear

0.0 0.2 0.4 0.6 0.8 1.0

20.0

020

.05

20.1

020

.15

20.2

020

.25

Partial Dependence on am

am

Figure 4.27: The importance of each variable in the random-forest model.

frm <- mpg ˜ cyl + disp + hp + wttr = tree(frm, data=mtcars)summary(tr)

#### Regression tree:## tree(formula = frm, data = mtcars)## Variables actually used in tree construction:## [1] "wt" "cyl" "hp"## Number of terminal nodes: 5## Residual mean deviance: 4.023 = 108.6 / 27## Distribution of residuals:## Min. 1st Qu. Median Mean 3rd Qu. Max.## -4.067 -1.361 0.220 0.000 1.361 3.833

plot(tr); text(tr)

This tree is the same as the one in Figure 4.13.



|wt < 2.26

cyl < 7

cyl < 5 hp < 192.5

30.07

22.58 19.74 16.79 13.41

Figure 4.28: The stable tree model underpinned by the random-forest technique.

4.7 Artificial Neural Networks (ANN)

4.7.1 The basics of ANNs in RNeural networks are a class of models that are inspired on our understand-ing of the animal brain. In the brain one has neurons that can be in twostates: firing (ie. active) and non-firing (non-active). Those neurons areconnected and the “decisions” to fire or not fire is made based on the in-coming signals of the surrounding neurons. If the sum of all signals ex-ceeds a certain threshold, the neuron will be in firing mode itself in thenext round, if not then it will be inactive in the next round.

The human brain consist of a neural network (NN) and will work atmore or less 100 Hz. While the word neural network actually refers tothe biological brain, for the purpose of this book we use ANN and NN assynonyms.

Artificial neural networks (ANNs) not only display similarities with ourbrain in the way they are designed, but also display the ability to “learn”from examples. This is called “training the neural network”.

In a first and simple approach one could consider a two-dimensionalneural network of 100 by 100 pixels that can be either black (inactive) or


4.7. ARTIFICIAL NEURAL NETWORKS (ANN)

white (firing). One could define an energy function such that it has localminima for the letters A to Z. Then the system will be able to recognize ahand-written letter that is drawn.

This is a very simple neural network with just one layer (no hidden lay-ers and even the output layer is the input layer itself) and the local optimaof the energy function are pre-defined. Even in this simple example thereare many degrees of freedom to design the neural network.7

The potential in each node is something like

Eij(t+ 1) = f

(N∑k=1

M∑l=1

)wij,klEkl(t)

)where f is typically something like

f(x) =

{0 if g(

∑Nk=1

∑Ml=1)wij,klEkl(t)) < θ

1 if g(∑N

k=1

∑Ml=1)wij,klEkl(t)) ≥ θ

The weight wij,kl is the strength of the connection between neuron ij andneuron kl (two indices since in our example we work from a two-dimensionalrepresentation). When the network is being trained, the weights are modi-fied till an optimal fit is achieved. In its turn there is a possible choice to bemade for g(α). Typically one will take something like the logit, logistic orth() function. For example:

g(α) = logistic(α) = logit−1(α) =1

1 + e(−α)

or

g(α) = Φ−1

(√π

8α

)(the scaling of alpha is not necessary but if applied the derivative in 0 willbe the same as the logistic function) or

g(α) = tanh(α)

The parameter θ is of course a number that also is “free” to be chosen.While this approach already has some applications neural nets become

more useful and better models if we allow them to learn by themselves andallow them to create internally hidden layers of neurons.

For example, in image recognition, it is possible to make a NN learn toto identify images that contain gorillas cats by analysing example imagesthat have been labelled as “has gorilla” or “has no gorilla”. Training theNN on a set of sample images will make it capable of recognizing gorillas

7Please note that a one-dimensional, one-layer neural net with a binary output parame-ter is similar to a logistic regression.



in new pictures. They do this without any a priori knowledge about goril-las (it is for example not necessary for the neural net to know that gorillashave four limbs, are usually black or grey, have 2 brown eyes, etc.). TheNN will instead create in its hidden layers some abstract concept of a go-rilla. The downside of neural networks is that it is for humans very hard toimpossible to understand what that concept is or how it can be interpreted.8

So, an ANN is best understood as a set of connected nodes (called artifi-cial neurons, a simplified version of biological neurons in an animal brain).Each connection (a simplified version of a synapse) between artificial neu-rons can transmit a signal from one to another. The artificial neuron thatreceives the signals can process it and then signal the other artificial neu-rons connected to it.

Usually the signal at a connection between artificial neurons is a realnumber, and the output of each artificial neuron is calculated by a non-linear function of the sum of its inputs as explained above. Artificial neu-rons and connections typically have a weight that adjusts as learning pro-ceeds. The weight increases or decreases the strength of the signal at aconnection. Artificial neurons may have a threshold such that only if theaggregate signal crosses that threshold is the signal sent. Typically, artifi-cial neurons are organized in layers. Different layers may perform differentkinds of transformations on their inputs. Signals travel from the first (in-put), to the last (output) layer, possibly after traversing the layers multipletimes.

So, in order to make an ANN learn it is important to be able to calculatethe derivative so that weights can be adjusted.

## ## Attaching package: ’neuralnet’## The following object is masked from ’package:ROCR’:#### prediction

Another way to see ANNs is as an extension of the logistic regression.Actually the neurons inside the neural network have two possible states:active or inactive (hence 0 or 1). Every layer in a neural network can beconsidered as a logistic regression of many parameter onto an intermediateparameter in a hidden layer.

The interpretation of these neurons in internal layers are quite abstract.In fact they do not correspond necessarily to a real-world aspect (such as

8This is why banks for example are very slow to adopt neural networks in credit anal-ysis. They will rather rely in less powerful linear regression models. Regressions are ex-tremely transparent and it becomes easy to explain to regulators or in court why a certaincustomer was denied a loan and it is always possible to demonstrate that there has beenno illegal discrimination (eg. refusing the loan because a person belongs to a certain racialminority group. With neural networks this is not possible, it might be that the neural nethas in its hidden layer something like a racial bias and that the machine derives that racialbackground via other parameters.



carb

gear

drat

cyl

disp

hp

am

qsec

wt

mpg

Figure 4.29: A logistic regression is actually a neural network with one neuron.

age, colour, race, sex, weight, etc.). This means that it is quite hard to un-derstand how the decision process of a neural network works. One caneasily observe the weights of the neurons in layer zero, but that does notmean that these are representative for the way a network makes a decision(except in the case where there is only one layer with one neuron . . . hencein the case where there is equivalence with a logistic regression.

There is a domain of research that tries to extract knowledge from neu-ral networks. This is called “rule extraction”, whereby the logic of the neu-ral net is approximated by a decision tree. The process is more or less asfollows:

• one will first try to understand the logic of the hidden variables withclustering analysis;

• approximate the ANN’s predictions by decisions on the hidden vari-ables;

• approximate the hidden variables by decisions on the input variables;

• collapse the resulting decision tree logically.

See for example Setiono et al. (2008), Hara and Hayashi (2012), Jacobs-son (2005) Setiono (1997) or the IEEE survey in Tickle et al. (1998).



As one could expect, there are several packages in R that help to fitneural networks and train them properly. First we will look at a naiveexample and then address the issue of over-fitting.

Neural Networks in R

#install.packages("neuralnet")library(neuralnet)nn1 <- neuralnet(mpg˜wt+qsec+am+hp+disp+cyl+drat+gear+carb,

data=mtcars, hidden=c(3,2),linear.output=TRUE)

An interesting parameter for the function neuralnet is the parameter“hidden”. This is a vector with the number of neurons in each layer, thenumber of layers will correspond to the length to the vector provided. Forexample “c(10,8,5)” implies three hidden layers (the first has 10 neurons,the second 8 and the last 5). The total configurations is then 9/3/2/1, wherewe have 13 input nodes and one for the output.

In fact it seems that neural networks are naturally good in approachingother functions and already with one layer an ANN is able to approach anycontinuous function. For discontinuous functions or concepts that cannotbe expressed as functions more layers may be required.

It appears that one best chooses a number of neurons that is not toosmall but does not exceed the number of input layers.

Another important parameter is linear.output, set this to true to esti-mate a continues value (regression) and to FALSE in order to solve a classi-fication problem.

As expected the function plot() applied on an object of the class “nn”will produce a visual that makes sense for that class.

plot(nn1,rep="best",information=FALSE);

If the parameter rep is set to ”best”, then the repetition with the smallesterror will be plotted. If not stated all repetitions will be plotted, each in aseparate window.9

The black lines in the plot of the ANN represent the connections be-tween each neuron of the previous and the following layer and the numberis the weight of that connection. The blue lines show the bias term addedin each step. The bias can be thought as the intercept of a linear model..

9While in interactive mode this might be confusing, in batch mode this will create prob-lems. For example when using knitr and LATEXthis will cause the plot to fail to render inthe document, but not result in any error message.



1.32015

1.00

329

−0.7

612

carb

1.217530.30

289

−0.3

0856

gear

−0.48542

0.44493

1.36

958

drat

0.44847

−0.30087

−0.7

0395

cyl

0.80415

1.05051

1.849

99

disp

0.18964

−0.79984

−0.40674

hp

0.47011

−2.20104

1.68255am

−1.348332.12978

−0.93368qsec

2.072151.61599

−1.66177

wt

2.534767.27

177

2.59469

8.92079

3.2651

6.66454

5.0161

9.17429

mpg

0.448071.56324

−0.19615

1

1.87945

9.05614

1

5.90043

1

Figure 4.30: A simple neural net fitted to the data-set of mtcars, predicting thempg. In this example we predict the fuel consumption of a car based on some othervalues in the dataset t mtcars.

4.7.2 An example of a work-flow to develop an ANNTo get a better illustration of the power and potential complexity of neuralnets, we will now have a look at a larger data-set. A good starting pointis the Boston dataset from the MASS package. While still very containableit is a larger dataset containing house values in Boston. We will modelthe the median value of owner-occupied homes (medv) using all the othercontinuous variables available.

library(MASS)d <- Boston

Step 1: missing data

apply(d,2,function(x) sum(is.na(x)))

## crim zn indus chas nox rm age## 0 0 0 0 0 0 0## dis rad tax ptratio black lstat medv## 0 0 0 0 0 0 0



If there would be missing data, we should first address this issue, butthe data is complete. The next step is to split the data in a training andtesting subset.

Step 2: split the data in test and training set

set.seed(606) # set the seed for the random generatoridx.train <- sample(1:nrow(d),round(0.75*nrow(d)))d.train <- d[idx.train,]d.test <- d[-idx.train,]

Step 3: fit a challenger model

lm.fit <- glm(medv˜., data=d.train)summary(lm.fit)

#### Call:## glm(formula = medv ˜ ., data = d.train)#### Deviance Residuals:## Min 1Q Median 3Q## -16.5658392 -2.7989595 -0.6097487 1.7993942## Max## 25.9746796#### Coefficients:## Estimate Std. Error t value## (Intercept) 34.290260343 6.094247752 5.62666## crim -0.063150382 0.053387963 -1.18286## zn 0.043010601 0.016504165 2.60605## indus 0.061490014 0.075637560 0.81296## chas 3.268461913 1.057104821 3.09190## nox -16.358972571 4.563246062 -3.58494## rm 3.864363178 0.482956573 8.00147## age 0.005697014 0.015550509 0.36636## dis -1.313437501 0.228678696 -5.74359## rad 0.285718277 0.080751195 3.53825## tax -0.012681334 0.004452210 -2.84832## ptratio -0.958375902 0.157614907 -6.08049## black 0.010690565 0.003333296 3.20721## lstat -0.563896326 0.059333177 -9.50390## Pr(>|t|)## (Intercept) 0.000000036575737176 ***## crim 0.23763321## zn 0.00953346 **## indus 0.41677174



## chas 0.00214132 **## nox 0.00038284 ***## rm 0.000000000000016403 ***## age 0.71431154## dis 0.000000019506662781 ***## rad 0.00045463 ***## tax 0.00464325 **## ptratio 0.000000003014943650 ***## black 0.00145839 **## lstat < 0.000000000000000222 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for gaussian family taken to be 24.15937525)#### Null deviance: 33120.7447 on 379 degrees of freedom## Residual deviance: 8842.3313 on 366 degrees of freedom## AIC: 2304.3044#### Number of Fisher Scoring iterations: 2

pr.lm <- predict(lm.fit,d.test)MSE.lm <- sum((pr.lm - d.test$medv)ˆ2)/nrow(d.test)

The sample(x,size) function outputs a vector of the specified sizeof randomly selected samples from the vector x. By default the sampling iswithout replacement, so the variable idx.train is a vector of indices.

We have now a linear model that can be used as a challenger model.

Step 4: rescale dataBefore fitting a neural network it is useful to normalize data between the

interval [0,1] or [-1,1]. It appear that this helps the optimisation procedureto converge faster and it makes the results easier to understand. Thereare many methods available to normalize data (z-normalization, min-maxscale, logistic scale, etc.). In this example we will use the min-max methodmethod and scale the data in the interval [0,1].

d.maxs <- apply(d, 2, max)d.mins <- apply(d, 2, min)

d.sc <- as.data.frame(scale(d, center = d.mins,scale = d.maxs - d.mins))

d.train.sc <- d.sc[idx.train,]d.test.sc <- d.sc[-idx.train,]



0.7089512.05783

1.64

815

0.21

089

1.28

575

5.41

048

−0.5

3327

−1.85715−5.5642

−1.674

64

−0.6

9694

−0.8

5304

0.11

269

1.52

066

−0.30294

−1.545030.63758

1.17

458

−0.6

1147

1.34

147

0.73

581

2.00529

18.42805

1.295580.7

2142

0.48

36

−1.2

9658

6.46

879

2.13743

3.88136

1.57374−5.7538

−1.9

7865

−0.1

5453

0.00

353

−32.61387

3.64694

−0.32208

2.923213.0288

−0.0

9683

−0.0

204

0.34497−2.57672

0.51342

0.836670.67876

−1.00

68

0.79

16

−3.056780.57195

−1.56271

−0.58147

−2.85965−1.74405−3

.958

76

−1.739046.422811.57597

0.1161

−0.35191−0.18483

1.473

64

0.522070.827650.89732

−0.99261

−2.16589

0.89088−0.05105

2.305985.21593−0.85426−0.98074

0.96086

−0.53951.8663112.63847

−9.91526−5.15746−0.02938

−1.25616

−1.17816

1.95102

4.57927−6.23877−1.3568716.3800214.75704

1.56529

−31.11832

2.742670.24508

−2.7

5525

−60.

1265

4

5.47

375

−0.98394

−1.24661

−0.11

908

−1.5

2851

1.31

797

0.04643

1.34063

−0.635230.

3335

7

−2.5

3543

1.86417

0.53471

−0.390794.42711

−0.2

8284

−0.773231.19305

−0.79715

−3.24256−2.24695

0.146681.978810.32071

−1.05725

−2.61236

0.071731.43192−1.67818

−2.69016

0.09708

1.20944−1.11361

0.62

142

0.01

982

0.87

275

0.2086

0.1559−0.03222

−0.5

6036

1.74

568

−0.370751.8399

1.575670.30946

−0.2

8166

1.603820.465

1.0552

1.0796−0.82129

1.371140.037670.24435

1.43292

−4.34831

0.67

305

1.79448

−0.61959

1.16056

−1.13124

medv

−0.8705−1.6188

0.00486−1.03956

0.73637

0.82292

−0.52401

1

0.03332−0.52237

1.46552

0.49415

−0.26947

1

−0.33839−0.15842

0.25109

−0.51806

0.8124

1

0.11919

1

Figure 4.31: A visualisation of the ANN.

Note that scale returns a matrix and not a data-frame, so we use thefunction as.data.frame() to coerce the results in to a data-frame.

Step 5: train the ANNNow we are ready to train the ANN.

library(neuralnet)

# since the shorthand notation y˜. does not work in the# neuralnet() function we have to replicate it:nm <- names(d.train.sc)frm <- as.formula(paste("medv ˜", paste(nm[!nm %in% "medv"],

collapse = " + ")))

nn2 <- neuralnet(frm,data=d.train.sc,hidden=c(7,5,5),linear.output=T)

plot(nn2,rep="best")

Now we can predict the values for the test data-set based on this modeland then calculate the MSE. Since the ANN was trained on scaled data, weneed to scale it back in order to make a meaningful comparison.



Step 6: test the model on the test datapr.nn2 <- compute(nn2,d.test.sc[,1:13])

#medv is the 14th column

pr.nn2<-pr.nn2$net.result*(max(d$medv)-min(d$medv))+min(d$medv)test.r<-(d.test.sc$medv)*(max(d$medv)-min(d$medv))+min(d$medv)

MSE.nn2 <- sum((test.r - pr.nn2)ˆ2)/nrow(d.test.sc)print(paste(MSE.lm,MSE.nn2))

## [1] "18.4713538272444 8.96805797887096"

Apparently the net is doing a better work than the linear model at pre-dicting medv. However, this result will depend on

1. our choices for the number of hidden layers and the numbers of neu-rons in each layer,

2. the selection of the training data-setThe following code produces a plot to visualize the performance of the

two models.

par(mfrow=c(1,2))

plot(d.test$medv,pr.nn2,col='red',main='Observed vs predicted NN',pch=18,cex=0.7)

abline(0,1,lwd=2)legend('bottomright',legend='NN',pch=18,col='red', bty='n')

plot(d.test$medv,pr.lm,col='blue',main='Observed vs predicted lm',pch=18, cex=0.7)

abline(0,1,lwd=2)legend('bottomright',legend='LM',pch=18,col='blue', bty='n',

cex=.95)

We see that indeed the predictions of the linear model are closer to theobserved values. However, again we stress the fact that this picture can bevery different for other choices of hidden layers. Since in our data-set thereare not too many observations it is possible to plot both in one graph.

plot (d.test$medv,pr.nn2,col='red',main='Observed vs predicted NN',pch=18,cex=0.7)

points(d.test$medv,pr.lm,col='blue',pch=18,cex=0.7)abline(0,1,lwd=2)legend('bottomright',legend=c('NN','LM'),pch=18,

col=c('red','blue'))



10 20 30 40 50

1020

3040

50

Observed vs predicted NN

d.test$medv

pr.n

n2

NN

10 20 30 40 50

010

2030

40

Observed vs predicted lm

d.test$medv

pr.lm

LM

Figure 4.32: A visualisation of the the performance of the ANN compared to thelinear regression model.

10 20 30 40 50

1020

3040

50

Observed vs predicted NN

d.test$medv

pr.n

n2

NNLM

Figure 4.33: A visualisation of the the performance of the ANN compared to thelinear regression model with both models in one plot.



Cross ValidationA more profound way to validate a model is to rely on a “cross-validation”,

where we do the split between training and test data a number of times (K),each time fit the model and compare the results. The average of our chosenerror measure will then be indicative for the model.

We are going to implement a cross validation using a for loop for theneural network and the cv.glm() function in the boot package for thelinear model.

Below is the code for the 10 fold cross validation MSE for the linearmodel:

library(boot)set.seed(123)lm.fit <- glm(medv˜.,data=d)cv.glm(d,lm.fit,K=10)$delta[1] #the estimate of prediction error

## [1] 23.56128891

Now we will fit the ANN. Note that we split the data so that we re-tain 90% of the data in the training set and 10% test set. This we will dorandomly for ten times.

It is possible to show a progress bar via ply.To see this progress bar, uncomment the relevant lines.

Cross Validation of the ANN

set.seed(450)cv.error <- NULLk <- 10#library(plyr)#pbar <- create_progress_bar('text')#pbar$init(k)for(i in 1:k){

index <- sample(1:nrow(data),round(0.9*nrow(data)))train.cv <- d.sc[index,]test.cv <- d.sc[-index,]nn2 <- neuralnet(frm,data=train.cv,hidden=c(7,5,5),

linear.output=TRUE)#the explaining variables are in the first 13 rows, so:pr.nn2 <- compute(nn2,test.cv[,1:13])

pr.nn2 <- pr.nn2$net.result * (max(d$medv) - min(d$medv))+ min(d$medv)

test.cv.r <- (test.cv$medv) * (max(d$medv) - min(d$medv))+ min(d$medv)

cv.error[i] <- sum((test.cv.r - pr.nn2)ˆ2)/nrow(test.cv)#pbar$step() #uncomment to see the progress bar

}



10 15 20 25 30

Cross Validation error (MSE) for the ANN

MSE

Figure 4.34: A boxplot for the MSE cross validation for the ANN.

This can take a while. Once it is finished, we calculate the average MSEand plot the results as a boxplot

mean(cv.error)

## [1] 16.24336205

cv.error

## [1] 29.100866621 19.141683605 7.878882838 9.633495442## [5] 11.528987755 15.537789044 7.526869125 13.769259438## [9] 18.782707852 29.533078779

boxplot(cv.error,xlab='MSE',col='gray',border='blue',names='CV error (MSE)',main='Cross Validation error (MSE) for the ANN',horizontal=TRUE)

As you can see, the average MSE for the neural network is lower thanthe one of the linear model although there is a lot of variation in the MSEsof the cross validation. This may depend on the splitting of the data or therandom initialization of the weights in the ANN.

It is also important to realize that this result in its turn can be influencedby the chosen seed for the random generator. By running the simulationdifferent times with different seeds it is possible to get an idea about the



sensitivity of the MSE for this seed.



4.8 Bootstrapping

Bootstrapping is the process of working on a subset of the data and cal-culating parameters of mean, spread and even calibrating models on thatsubset.

The reasons to do this is

• the whole data-set is too large to work with

• to test the robustness of the model we calibrate it on a subset of thedata and see how it performs on another set of data.

Bootstrapping in RThe function “sample()” takes a sample from data

Definition 22 .:. sample() .:.

sample(x, size, replace = FALSE, prob = NULL) with

• x: either a vector of one or more elements from which tochoose, or a positive integer.

• size: the number of items to select from x

• replace: set to TRUE if sampling is to be done with replace-ment

• prob: a vector of probability weights for obtaining the ele-ments of the vector being sampled

Example: Sampling the SP500 data

# create the sampleSP500.sample <- sample(SP500,size=100)

# visualize the samplepar(mfrow=c(2,2))hist(SP500,main="Histogram of all data",fr=FALSE,

breaks=c(-9:5),ylim=c(0,0.4))hist(SP500.sample,main="Histogram of the sample",

fr=FALSE,breaks=c(-9:5),ylim=c(0,0.4))boxplot(SP500,main="Boxplot of all data",ylim=c(-9,5))boxplot(SP500.sample,main="Boxplot of the sample",

ylim=c(-9,5))


4.8. BOOTSTRAPPING

Histogram of all data

SP500

Den

sity

−8 −6 −4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Histogram of the sample

SP500.sample

Den

sity

−8 −6 −4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

●●

●

●

●

●●

●

●

●

●

●

●●

●

●●●

●●

●

●

●●

●●●●

●

●

●●●●●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●

●●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

−8−4

02

4

Boxplot of all data

●●

●

−8−4

02

4

Boxplot of the sample

Figure 4.35: Bootstrapping the returns of the S&P500 index.

mean(SP500)

## [1] 0.04575267041

mean(SP500.sample)

## [1] 0.03431645051

sd(SP500)

## [1] 0.9477464375

sd(SP500.sample)

## [1] 0.7677557665



4.9 Cross-Validation

tba.



Data Visualization Methods

5.1 Heat-maps

Long gone are Galileo’s times where data was scarce and had to producedcarefully. We live in a time where there is a lot of data around. Typicallycommercial and scientific institutions have too much data to handle easily,or to see what is going on. Imagine for example to have the data of a bookof one million loans. One way to get started is a heatmap. A heatmap is inessence a visualization of a matrix.

Heatmap

d=as.matrix(mtcars,scale="none")heatmap(d)

Note that this function by default will change the order of rows andcolumns in order to be able to create a pattern, the heuristics are visualizedby the dendrograms.

Heatmap ScalingWhile this is interesting the results for the low numbers are all red. This

is because the number are not of the same nature, so we might want torescale.

heatmap(d,scale="column")

The function heatmap has many useful parameters

• scale the default is “row”, it can be turned off by using “none” orswitched to “column”;

CHAPTER 5. DATA VISUALIZATION METHODS

cyl

am vs

carb w

t

drat

gear

qsec

mpg hp

disp

Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1−9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914−2Toyota Corona

Figure 5.1: Heatmap for the “mtcars” data.

cyl

am vs

carb w

t

drat

gear

qsec

mpg hp

disp

Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1−9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914−2Toyota Corona

Figure 5.2: Heatmap for the “mtcars” data with all columns rescaled


5.1. HEAT-MAPS

• Rowv and Colv: determines if an how the row or column dendro-gram should be computed and re-ordered. Use “NA” to suppressre-ordering;

• na.rm: a logical value that indicates how missing values have to btreated;

• labCol and labRow: these can be character vectors with row and col-umn labels to use in the plot. Their default is rownames(x) or col-names(x) respectively (with x the matrix supplied to the function);

• main, xlab, ylab: have their usual meaning;

• keep.dendro: logical indicating if the dendrogram(s) should be kept(when Rowv and/or Colv are not NA).

• other parameters can be explored by the command help(heatmap)



5.2 Text Mining

5.2.1 Word Clouds

generating word clouds in R

# if neccesary first download the packagesinstall.packages("tm") # text mininginstall.packages("SnowballC") # text stemminginstall.packages("RColorBrewer") # color palettesinstall.packages("wordcloud") # word-cloud generator

# then load the packageslibrary("tm")

## Loading required package: NLP## ## Attaching package: ’NLP’## The following object is masked from ’package:ggplot2’:## ##annotate

library("SnowballC")library("RColorBrewer")library("wordcloud")

## ## Attaching package: ’wordcloud’## The following object is masked from ’package:gplots’:## ##textplot

# then load the a txt-file in a variable## The Divine Comedy## translated by Henry Wadsworth Longfellow## (e-text courtesy ILT's Digital Dante Project)t <- readLines("dante.txt")

# then creat a corpus of textdoc <- Corpus(VectorSource(t))

Cleaning the text


5.2. TEXT MINING

# the file has still a lot of special characters# eg. the following replaces /, @ and | with space:toSpace <- content_transformer(function (x , pattern )

gsub(pattern, " ", x))doc <- tm_map(doc, toSpace, "/")doc <- tm_map(doc, toSpace, "@")doc <- tm_map(doc, toSpace, "\\|")

note that the command inspect(doc) would return the file content..The tm map() function is used to remove unnecessary white space, to

convert the text to lower case, to remove common stopwords like the, we.The information value of stopwords is near zero due to the fact that

they are so common in a language. Removing this kind of words is usefulbefore further analyses. For stopwords, supported languages are danish,dutch, english, finnish, french, german, hungarian, italian, norwegian, por-tuguese, russian, spanish and swedish. Language names are case sensitive.

Ill also show you how to make your own list of stopwords to removefrom the text.

You could also remove numbers and punctuation with removeNum-bers and removePunctuation arguments.

Another important preprocessing step is to make a text stemming whichreduces words to their root form. In other words, this process removes suf-fixes from words to make it simple and to get the common origin. For ex-ample, a stemming process reduces the words moving, moved and move-ment to the root word, move.

Note that, text stemming require the package SnowballC.The R code below can be used to clean your text :

# Convert the text to lower casedoc <- tm_map(doc, content_transformer(tolower))# Remove numbersdoc <- tm_map(doc, removeNumbers)# Remove english common stopwordsdoc <- tm_map(doc, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordoc <- tm_map(doc, removeWords, c("one","said","upon","unto",

"now","made","still","will","thus","come","within","see"))

# Remove punctuationsdoc <- tm_map(doc, removePunctuation)# Eliminate extra white spacesdoc <- tm_map(doc, stripWhitespace)# Text stemming# doc <- tm_map(doc, stemDocument)



Note however, that the word stemming is not perfect. For example itreplaces “probability density function” with “‘probabl densiti function”.That’s why we choose to comment it out.

Build a term-document matrixDocument matrix is a table containing the frequency of the words. Col-

umn names are words and row names are documents. The function TermDocumentMatrix()from text mining package can be used as follows.

dtm <- TermDocumentMatrix(doc)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)

## word freq## thou thou 384## thee thee 156## thy thy 102## master master 81## saw saw 71## turned turned 68## may may 68## great great 64## art art 62## doth doth 60

Plot word frequencies can be plottedThe frequency of the first 10 frequent words are plotted :

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,col ="lightblue", main ="Most frequent words",ylab = "Word frequencies")

Generate the Word cloud

set.seed(1969)wordcloud(words = d$word, freq = d$freq, min.freq = 10,

max.words=200, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

The important parameters of the wordcloud function are:

• words : the words to be plotted

• freq : their frequencies


5.2. TEXT MINING

thou

thee th

y

mas

ter

saw

turn

ed

may

grea

t

art

doth

Most frequent words

Wor

d fre

quen

cies

0

50

100

150

200

250

300

350

Figure 5.3: The frequency of the ten most occuring words in this text.

thouthee

thy

mas

ter

saw

turnedmay

great

art

doth

people

way

little

eyes

roun

d

came

shall placeworld

make

first

muchtimewithout

beheld

side

even

man

y

know

alreadyforth

two

speak

wel

llike

feet

alon

g

let

far

whe

nce

everyhead

tell

othe

rs

never

min

etogether

infe

rno

canto

say

face

beho

ldan

othe

r

guide

shalt

wor

dsgoodback

hast

bega

n

dost

seemed

man

among

yet

ther

efor

e

thinedownward

think

away

air

hear

new

heard

three

left

long

behind

soon

lead

er

things

soul

firewent

hand

fear

fullheart

right

cried city

god

look

wholly

near

life

water

must

bank

held

put

can

foot

ever

alon

e

take

sightfarth

er

less

name bene

ath

front

deathvalley

oer

whose

replied

stand

wherefore

crag

whe

reat

onw

ard

love

wee

ping

pain

seems

heav

en

became

spirit

might

dead

land

circle

tis

botto

m

spea

king

poin

t

upward

pass

retu

rn

just

son

speech

since

hand

s

bloo

d

hair

saying

thyself

eye

times give

silent

high

desire

soul

s

tow

ards

win

dwon

t

part

breast

clos

e

passed

moved

earth

none

ere

pow

er

flame

makes

fell

answ

er

mind

stoo

d

towrds

moat road

issued turn

appe

ared

poet

father

lord

teeth

drew

fall

seized

desc

ende

d

hold

arms

thought

body

light

hell

seco

nd

manner

seen

wings

sweetmouth

Figure 5.4: Wordcloud Dante Allegieri’s “Divine Comedy”.



• min.freq : words with frequency below min.freq will not be plotted

• max.words : maximum number of words to be plotted

• random.order : plot words in random order. If false, they will beplotted in decreasing frequency

• rot.per : proportion words with 90 degree rotation (vertical text)

• colors : color words from least to most frequent. Use, for example,colors =black for single color.

5.2.2 Word AssociationsYou can have a look at the frequent terms in the term-document matrix asfollow. In the example below we want to find words that occur at least 70times :

Word Associations in R

findFreqTerms(dtm, lowfreq = 150)

## [1] "thou" "thee"

One can analyze the association between frequent terms (i.e., termswhich correlate) using findAssocs() function.

# eg. for the word "thou"findAssocs(dtm, terms = "thou", corlimit = 0.15)

## $thou## art hast shalt dost canst wouldst seest## 0.30 0.27 0.26 0.25 0.19 0.16 0.16



Examples

6.1 Financial Analysis with QuantMod

The quantmod package for R is directly helpful for financial modeling.It is “designed to assist the quantitative trader in the development, test-ing, and deployment of statistically based trading models” (see https://www.quantmod.com/) It allows the user to build financial models, hasa simple interface to get data and has a suite of plots that not only lookprofessional but also provide insight for the trader.

Its website is https:\\www.quantmod.com.

quantmod

if(!any(grepl("quantmod", installed.packages()))){install.packages("quantmod")}

library(quantmod)

## Loading required package: xts## Loading required package: zoo## ## Attaching package: ’zoo’## The following objects are masked from ’package:base’:## ##as.Date, as.Date.numeric## Loading required package: TTR## Version 0.4-0 included new data defaults. See ?getSymbols.

getSymbols("GOOG",src="yahoo") #get Google's history

## ’getSymbols’ currently uses auto.assign=TRUE by default, butwill## use auto.assign=FALSE in 0.5-0. You will still be ableto use## ’loadSymbols’ to automatically load data. getOption("getSymbols.env")##and getOption("getSymbols.auto.assign") will still be checkedfor## alternate defaults.## ## This message is shown once persession and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE).See ?getSymbols for details.

CHAPTER 6. EXAMPLES

## ## WARNING: There have been significant changes to Yahoo Financedata.## Please see the Warning section of ’?getSymbols.yahoo’for details.## ## This message is shown once per session andmay be disabled by setting## options("getSymbols.yahoo.warning"=FALSE).

## [1] "GOOG"

getSymbols(c("GS","GOOG"),src="yahoo") # to load more than one

## [1] "GS" "GOOG"

Quantmod also allows the user to specify lookup parameters, and savethem for future use.

setSymbolLookup(HSBC='yahoo',GOOG='yahoo')setSymbolLookup(DEXJPUS='FRED')setSymbolLookup(XPTUSD=list(name="XPT/USD",src="oanda")) #Pt

# save the settings in the filesaveSymbolLookup(file="qmdata.rda")# new sessions call loadSymbolLookup(file="qmdata.rda")getSymbols(c("HSBC","GOOG","DEXJPUS","XPTUSD"))

## [1] "HSBC" "GOOG" "DEXJPUS" "XPTUSD"

What type of data does quantmod provide?the function stockSymbols()can provide a list of symbols that are quoted on Amex, Nasdaq and

NSYE.

stockList <- stockSymbols()

## Fetching AMEX symbols...## Fetching NASDAQ symbols...## Fetching NYSE symbols...

nrow(stockList) # number of symbols

## [1] 6918

head(stockList[,1]) # the symbols are in the first column

## [1] "AAMC" "AAU" "ACU" "ACY" "AE" "AEF"

If one would like to download all symbols in one data-frame, then onecan use the package BatchGetSymbols.

For other services it is best to refer to the website of the data provider,eg. ttps://fred.stlouisfed.org/}.Alternativelyonecanuseaservicesuc


6.1. FINANCIAL ANALYSIS WITH QUANTMOD

as ttps://www.quandl.com/data/FRED-Federal-Reserve-Economic-Data}tolocatedatafromavarietyofsources.ForFXdata\xyNomencl{FX}{foreignexcange there w.oanda.com is a good startingpoint and the function getFX() will do most of the work.

getFX("EUR/PLN",from="2005-01-01")

## Warning in getSymbols.oanda(Symbols = Currencies, from = from,to = to, : Oanda only provides historical data for the past180 days. Symbol: EUR/PLN

## [1] "EURPLN"

Metals can be found with the function getMetals(). More information about all possibilities can be found in the documen-

tation of quantmod: ttps://cran.r-project.org/web/packages/quantmod/quantmod.pdf}\subsection{Plottingwit quantmod

quantmodplotting

getSymbols("HSBC",src="yahoo") #get HSBC's data from Yahoo

## [1] "HSBC"

This created a data object with the name “HSBC” and it can be used di-rectly in the traditional functions and arithmetic, but quantmod also pro-vides some specific functions that make working with financial data easier.

barChart(HSBC)

# note: while plot(HSBC) will result in a line chart of the# the daily returns, there is also lineChart():lineChart(HSBC)

## candleChart(HSBC,subset='2018::2018-01')# candleChart(HSBC,subset='last 1 months') # same as previouscandleChart(HSBC, subset='last 1 years',theme="white",

multi.col=TRUE)

So there is no need to find a data source, download a portable for-mat (such as CSV), load it in a data-frame, name the columns, etc. ThegetSymbols() function downloads the daily data going ten years back in


CHAPTER 6. EXAMPLES

20

40

60

80

100

HSBC [2007−01−03/2018−09−28]

Last 43.990002

Volume (millions):1,541,400

0

5

10

15

Jan 032007

Jul 012008

Jan 042010

Jul 012011

Jan 022013

Jul 012014

Jan 042016

Jul 032017

Figure 6.1: Demonstration of the barChart() function of the package quantmod.

20

40

60

80

100

HSBC [2007−01−03/2018−09−28]

Last 43.990002


0

5

10

15

Jan 032007

Jul 012008

Jan 042010

Jul 012011

Jan 022013

Jul 012014

Jan 042016

Jul 032017

Figure 6.2: Demonstration ofthe lineChart() function of the package quandmod.



42

44

46

48

50

52

54

56

HSBC [2018−01−02/2018−09−28]

Last 43.990002


1.0

1.5

2.0

2.5

3.0

3.5

4.0

Jan 022018

Mar 012018

May 012018

Jul 022018

Sep 042018

Figure 6.3: Demonstration of the candleChart() function of the package quantmod.

the past (if available of course). The plot functions such as lineChart(),barChart() and candleChart() display the data in a professional andclean fashion. The looks can be customized with a theme-based parameterthat uses intuitive conventions.

There is much more and we encourage you to explore by yourself. Togive you a taste: let’s display a stock chart with indicators with three shortlines of code:

getSymbols(c("HSBC"))

## [1] "HSBC"

chartSeries(HSBC, subset='last 4 months')addBBands(n = 20, sd = 2, maType = "SMA", draw = 'bands',

on = -1)

where n is the sampling size of the moving average (of type “maType”– SMA is simple moving average) and sd is the multiplier for the standarddeviation to be used as band.

Bollinger bands aim to introduce a relative indication of how high astock is trading relative to its past. Therefore one calculates a moving av-erage and volatility over a given period (typically n = 20) and plots bandswith plus and minus a volatility multiplied with a multiplier (typicallysd = 2). If the stock trades below the lower band then it is to be consid-


CHAPTER 6. EXAMPLES

43

44

45

46

47

48

49

HSBC [2018−06−01/2018−09−28]

Last 43.990002Bollinger Bands (20,2) [Upper/Lower]: 45.427/42.403


0.81.01.21.41.61.82.0

Jun 012018

Jun 182018

Jul 022018

Jul 232018

Aug 062018

Aug 272018

Sep 172018

Figure 6.4: Bollinger bands with the package quandmod.

ered as low.The lines drawn correspond to ma− sd.σ, ma and ma+ sd.σ with σ the

standard deviation and where ma and sd take values as define above.Some traders will then sell it because it is considered as “a buy oppor-

tunity”, others might consider this as a break in trend and rather short it(or sell off what they had as it is not living up to its expectations).

6.1.1 The quantmod data structure

quantmod data structureUsing quantmod you will find that it is possible to accomplish com-

plex tasks with a very limited number of lines of code. This is made pos-sible thanks to the use of the xts package. This means that quantmoduses under the hood the extend time series package (short xts). Installingquantmod will coerce the installation of xts.

The use of xts offers many advantages. For example it makes it moreintuitive to manipulate data in function of the time-stamp.

myxtsdata["2008-01-01/2010-12-31"] # between 2 date-stamps

# all data before or after a certain time-stamp:xtsdata["/2007"] # from start of data until end of 2007



xtsdata["2009/"] # from 2009 until the end of the data

# select the data between different hours:xtsdata["T07:15/T09:45"]

Subsetting by Time and DateThe xts package aims to offer tools that made it easy to work with

time-based series. As such it extends the zoo class, and adds some newmethods for sub-setting data.1

HSBC['2017'] #returns HSBC's OHLC data for 2017HSBC['2017-08'] #returns HSBC's OHLC data for August 2017HSBC['2017-06::2018-01-15'] # from June 2017 to Jan 15 2018

HSBC['::'] # returns all dataHSBC['2017::'] # returns all data in HSBC, from 2017 onwardmy.selection <- c('2017-01','2017-03','2017-11')HSBC[my.selection]

The date format follows the ANSI standard (CCYY-MM-DD HH:MM:SS),the ranges are specified via the “::” operator. The main advantage is thatthese functions are robust regardless the underlying data: this data can bea quote every minute, every day or month and the functions will still work.

In order to re-run a particular model one will typically need the dataof the last few weeks, months or years. Also here xts comes to the rescuewith handy shortcuts:

last(HSBC) #returns the last quoteslast(HSBC,5) #returns the last 5 quoteslast(HSBC, '6 weeks') # the last 6 weekslast(HSBC, '-1 weeks') # all but the last weeklast(HSBC, '6 months') # the last 6 monthslast(HSBC, '3 years') # the last 3 years

# these functions can also be combined:last(first(HSBC, '3 weeks'), '5 days')#

Aggregating to a different time scaleOne of the interesting aspects in trading is that even while trading on

short frequency (such as milliseconds) one will not want to miss biggertrends and the other way around. So it is essential to be able to aggre-gate data into lower frequency objects. For example convert daily data tomonthly data.

1Type help(’[.xts’) to get more information about data extraction of time series.


CHAPTER 6. EXAMPLES

xts provides the tools to do these conversion in a robust and orderlymanner with the functions to.weekly() or to.monthly() for example,or the more specific functions such as to.minutes5() and to.minutes10()that will convert to 5 and 10 minute data respectively.

periodicity(HSBC)unclass(periodicity(HSBC))to.weekly(HSBC)to.monthly(HSBC)periodicity(to.monthly(HSBC))ndays(HSBC); nweeks(HSBC); nyears(HSBC)

As these functions are dependent on the upstream xts and not specificto quantmod they can be used also on non-OHLC data:

getFX("USD/EUR")

## [1] "USDEUR"

periodicity(USDEUR)

## Daily periodicity from 2018-04-05 to 2018-09-30

to.monthly(USDEUR)

## USDEUR.Open USDEUR.High USDEUR.Low USDEUR.Close## Apr 2018 0.815842 0.826440 0.807985 0.826440## May 2018 0.831248 0.864156 0.831248 0.856106## Jun 2018 0.856469 0.864520 0.846852 0.855900## Jul 2018 0.855998 0.859963 0.850368 0.854027## Aug 2018 0.856412 0.882386 0.855056 0.859050## Sep 2018 0.861891 0.865538 0.850048 0.861722

periodicity(to.monthly(USDEUR))

## Monthly periodicity from Apr 2018 to Sep 2018

Apply by PeriodOften it will be necessary to identify endpoints in your data by date

with the function endpoints. Those endpoints can be used with the func-tions in the period.apply family. This allows to calculate periodic min-imums, maximums, sums, and products as well as the more general user-defined.

endpoints(HSBC,on="years")

## [1] 0 251 504 756 1008 1260 1510 1762 2014 2266## [11] 2518 2769 2957



# find the maximum closing price each yearapply.yearly(HSBC,FUN=function(x) {max(Cl(x)) } )

## [,1]## 2007-12-31 99.519997## 2008-12-31 87.669998## 2009-12-31 63.950001## 2010-12-31 59.320000## 2011-12-30 58.990002## 2012-12-31 53.070000## 2013-12-31 58.610001## 2014-12-31 55.959999## 2015-12-31 50.169998## 2016-12-30 42.959999## 2017-12-29 51.660000## 2018-09-28 55.619999

# the same thing - only more generalsubHSBC <- HSBC['2012::']period.apply(subHSBC,endpoints(subHSBC,on='years'),

FUN=function(x) {max(Cl(x))} )

## [,1]## 2012-12-31 53.070000## 2013-12-31 58.610001## 2014-12-31 55.959999## 2015-12-31 50.169998## 2016-12-30 42.959999## 2017-12-29 51.660000## 2018-09-28 55.619999

# the following line does the same but is faster:as.numeric(period.max(Cl(subHSBC),endpoints(subHSBC,

on='years')))

## [1] 53.070000 58.610001 55.959999 50.169998 42.959999## [6] 51.660000 55.619999

6.1.2 Support functions supplied by quantmod

quantmod functionsquantmod has some useful features. For example quantmod dynam-

ically creates data objects creating a model frame internally after goingthrough some of steps to identify the sources of data required (and load-ing if required).


CHAPTER 6. EXAMPLES

There are some basic types of data for all financial assets: Op, Hi, Lo,Cl, Vo, Ad are respectively the Open, High, Low, Close, Volume, and Ad-justed2 of a given instrument and are stored in the columns of the dataobject.

There are some functions to test if an object is of the right format andif the instrument has certain data: is.OHLC(), has.OHLC(), has.Op(),has.Cl(), has.Hi(), has.Lo(), has.Ad(), and has.Vo(). Others ex-tract certain data such as Vo(), Cl(), Hi(), Ad(), Vo(). Still other func-tions such as seriesHi() and seriesLo() for example return respectthe high and low of a a given data-set.

seriesHi(HSBC)

## HSBC.Open HSBC.High HSBC.Low HSBC.Close## 2007-10-31 98.919998 99.519997 98.050003 99.519997## HSBC.Volume HSBC.Adjusted## 2007-10-31 1457900 56.919506

has.Cl(HSBC)

## [1] TRUE

tail(Cl(HSBC))

## HSBC.Close## 2018-09-21 44.919998## 2018-09-24 44.650002## 2018-09-25 44.759998## 2018-09-26 44.900002## 2018-09-27 44.880001## 2018-09-28 43.990002

There are even functions that will calculate differences, for example:

• OpCl(): daily percent change open to close

• OpOp(): daily open to open change

• HiCl(): the percent change from high to close

These functions rely on the following that are also available to use:

• Lag(): gets the previous value in the series

• Next(): gets the next value in the series

• Delt(): returns the change (delta) from two prices

2The adjusted closing price is the closing price where any eventual dividend is added ifit was announced after market closure.



Lag(Cl(HSBC))Lag(Cl(HSBC),c(1,5,10)) #One, five and ten period lagsNext(OpCl(HSBC))

# Open to close one, two and three-day lags:Delt(Op(HSBC),Cl(HSBC),k=1:3)

There are many more wrappers and functions such as period.min(),period.sum(), and period.prod(), and period.max.

Period ReturnsMore often than not it is the return that we are interested in and there is

a sete of functions that makes it easy and straightforward to do so.There is of course the master-function periodReturn(), that takes a

parameter “period’ to indicate what periods are designed. Then there isalso a suit of derived functions that carry the name of the relevant period.

The convention used is that the first observation of the period is thefirst trading time of that period; and the last the last observation is the lasttrading time of the period, on the last day of the period. xts has adoptedthe last observation of a given period as the date to record for the largerperiod. This can be changed via the indexAt argument, we refer to thedocumentation for more details.

dailyReturn(HSBC)weeklyReturn(HSBC)monthlyReturn(HSBC)quarterlyReturn(HSBC)yearlyReturn(HSBC)allReturns(HSBC) # all previous returns

6.1.3 Financial modeling in quantmod

Financial Models in quantmodTo specify financial models, there is the function specifyModel(). It

is recommend to read it’s help file. typically one can specify data withinthe call to specifyModel, and quantmod will lookup the relevant data andtake care of the data aggregation.

Consider the following naive model:

# First we create a quantmod object.# At this point we do not need to load data.

setSymbolLookup(SPY='yahoo',


CHAPTER 6. EXAMPLES

VXN=list(name='ˆVIX',src='yahoo'))

qmModel <- specifyModel(Next(OpCl(SPY)) ˜ OpCl(SPY) + Cl(VIX))

## Warning: VIX contains missing values. Some functions willnot work if objects contain missing values in the middle of theseries. Consider using na.omit(), na.approx(), na.fill(), etcto remove or replace them.

head(modelData(qmModel))

## Next.OpCl.SPY OpCl.SPY Cl.VIX## 2014-12-04 0.0006254149378 0.0005782548138 28447.69922## 2014-12-05 -0.0043851338785 0.0006254149378 26056.50000## 2014-12-08 0.0102755103556 -0.0043851338785 23582.80078## 2014-12-09 -0.0133553491651 0.0102755103556 21274.00000## 2014-12-10 0.0015204875044 -0.0133553491651 19295.00000## 2014-12-11 -0.0086360047801 0.0015204875044 17728.30078

qmModel is now a quantmod object holding the model formula anddata structure implying the next (Next) period’s open to close of the S&P500 ETF (OpCl(SPY)) is modelled as a function of the current period opento close and the current close of the VIX (Cl(VIX)).3

The call to modelData() extracts the relevant data set. A more directfunction to accomplish the same end is buildData().

A Simple Model with quantmodIn this section we will propose a simple (and naive) model to predict

the opening price of a stock based on its performance of the day before.First we import the data and decide what to model.

getSymbols('HSBC',src='yahoo') #google doesn't carry the adjusted price

## [1] "HSBC"

lineChart(HSBC)

The line-chart shows that the behaviour of the stock is very different inthe period after the crisis. Therefore we decide to consider only data after2010.

HSBC.tmp <- HSBC["2010/"] #see: subsetting for xts objects

The next step is to divide our data in a training data-set and a test-data-set. The training set is the set that we will use to calibrate the model and

3The VIX is the CBOE Volatility Index, known by its ticker symbol VIX, is a popularmeasure of the stock market’s expectation of volatility implied by S&P 500 index options,calculated and published by the Chicago Board Options Exchange



20

40

60

80

100

HSBC [2007−01−03/2018−09−28]

Last 43.990002


0

5

10

15

Jan 032007

Jul 012008

Jan 042010

Jul 012011

Jan 022013

Jul 012014

Jan 042016

Jul 032017

Figure 6.5: The evolution of the HSBC share for the last ten years.

then we will see how it performs on the test-data. This process will give usa good idea about the robustness of the model.

# use 70% of the data to train the model:n <- floor(nrow(HSBC.tmp) * 0.7)HSBC.train <- HSBC.tmp[1:n] # training dataHSBC.test <- HSBC[(n+1):nrow(HSBC.tmp)] # test-data# head(HSBC.train)

Till now we used the functionality of quantmod to pull in data, but thefunctions specifyModel

allows us to prepare automatically the data for modelling: it will alignthe next opening price with the explaining variables. Further modeldataallows to make sure the data is up-to-date.

# making sure that whenever we re-run this the latest data# is pulled in:m.qm.tr <- specifyModel(Next(Op(HSBC.train)) ˜ Ad(HSBC.train)

+ Hi(HSBC.train) - Lo(HSBC.train) + Vo(HSBC.train))

D <- modelData(m.qm.tr)

We decide to create an additional variable that is the difference betweenthe high and low prices of the previous day.


CHAPTER 6. EXAMPLES

D$diff.HSBC <- D$Hi.HSBC.train - D$Lo.HSBC.train # add columntail(D, n=3L) # the last value is NA

## Next.Op.HSBC.train Ad.HSBC.train Hi.HSBC.train## 2016-02-11 31.450001 25.784353 31.059999## 2016-02-12 32.040001 26.800282 31.959999## 2016-02-16 NA 27.077351 32.369999## Lo.HSBC.train Vo.HSBC.train diff.HSBC## 2016-02-11 30.389999 6385200 0.670000## 2016-02-12 31.190001 3999000 0.769998## 2016-02-16 31.980000 3546400 0.389999

D <- D[-nrow(D),] # so remove it

The column names of the data inherit the full name of the data-set. Thisis not practical since the names will be different in the training set and inthe test-set. So we rename them before making the model.

colnames(D) <- c("Next.Op","Ad","Hi","Lo","Vo","Diff")

Now we can create the model.

m1 <- lm(D$Next.Op ˜ D$Ad + D$Diff + D$Vo)summary(m1)

#### Call:## lm(formula = D$Next.Op ˜ D$Ad + D$Diff + D$Vo)#### Residuals:## Min 1Q Median 3Q## -13.7675581 -2.0172372 0.0813842 2.1346769## Max## 8.5005190#### Coefficients:## Estimate Std. Error t value## (Intercept) 6.89132644171267 0.92014596688203 7.48938## D$Ad 1.11486808781086 0.02271114048826 49.08904## D$Diff 4.34264248008493 0.34654677027649 12.53119## D$Vo -0.00000015118018 0.00000009093197 -1.66256## Pr(>|t|)## (Intercept) 0.00000000000011617 ***## D$Ad < 0.000000000000000222 ***## D$Diff < 0.000000000000000222 ***## D$Vo 0.096604 .## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



#### Residual standard error: 3.392296 on 1535 degrees of freedom## Multiple R-squared: 0.6501367,Adjusted R-squared: 0.6494529## F-statistic: 950.8094 on 3 and 1535 DF, p-value: < 0.00000000000000022204

The Volume of trading in the stock does not seem to play a significantrole, so we leave it out.

m2 <- lm(D$Next.Op ˜ D$Ad + D$Diff)summary(m2)

#### Call:## lm(formula = D$Next.Op ˜ D$Ad + D$Diff)#### Residuals:## Min 1Q Median 3Q## -13.7143118 -2.0502336 0.1499596 2.1269076## Max## 8.6317173#### Coefficients:## Estimate Std. Error t value## (Intercept) 6.26300690 0.83943543 7.46098## D$Ad 1.12810356 0.02128238 53.00645## D$Diff 4.05971451 0.30205885 13.44014## Pr(>|t|)## (Intercept) 0.00000000000014304 ***## D$Ad < 0.000000000000000222 ***## D$Diff < 0.000000000000000222 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.394243 on 1536 degrees of freedom## Multiple R-squared: 0.6495067,Adjusted R-squared: 0.6490503## F-statistic: 1423.197 on 2 and 1536 DF, p-value: < 0.00000000000000022204

The output of the command summary(m2) learns us that the variablesare all significant now. The R2 is every so slightly down, but in return onehas a much more stable model that is not over-fitted.

Some more tests can be done. We should also make a qq-plot to makesure the residuals are normally distributed.


CHAPTER 6. EXAMPLES

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

●●● ●●●

●

●

●

●●

●

●

●

●

●●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●●

●●

●●●

●

●

●

●

●●

●

●

●●

●

●

●●●●●●

●

●●

●●●

●

●

●

●●●

●

●

●●

●

●●

●

●

●

●●

●●●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●●●●

●

●

●●

●

●

●

●

●●

●

●

●●●

●

●

●●●●●

●

●

●

●●●

●

●

●

●●

●

● ●●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●●●●●

●●

●●

●

●

●

●

●●

●●

●●

●

●

●●●

●●●●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●●

●

●●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●●

●●

●

●●●●

●

●

●

●

●

●

●

●●●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●●

●●

●●●

●●●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●

●

●

●

●●

●

●●

●●

●●●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●●

●●

●

●●●

●●●●

●

●●●

●●

●●●●

●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●●

●●

●●●●

●●

●

●●

●

●●

●

●●

●

●

●

●●●

●●

●●

●

●

●

●

●

●

●

●●

●

●●●●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●●

●●

●

●

●

●●

●●

●●●●

●

●●

●●●●

●

●

●

●●

●

●

●●

●

●●

●●●

●

●●

●●

●

●●●●

●

●●

●●●●●

●●

●●●

●

●●

●

●

●

●●

●

●●

●

●●

●●

●

●

●

●

●●●

●●

●

●●

●●

●●

●●

●●

●

●

●●

●●

●●●●

●

●●

●

●

●

●

●●●●●●●

●

●●

●●

●

●●

●●

●

●

●●

●●●●

●●

●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●●

●

●●

●

●

●

●●●

●

●●●●

●

●●

●

●

●

●

●●

●●

●●

●

●

●●

●●

●●

●

●

●●

●

●

●●

●●●

●●●

●

●●●●●

●

●●●

●●●

●

●●●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●

●●

●●

●

●●

●

●●

●

●●●

●

●●●

●●

●

●

●

●●

●●

●

●●

●●

●

●●

●

●

●

●●●●●●

●

●

●●

●

●●

●

●●●●●

●●

●●

●

●●●●

●●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●●●

●●

●●●

●●●●

●

●●●●●

●

●

●●●

●●●

●

●●

●

●

●

●●

●●●●

●

●●

●●●

●

●●●●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●●

●●

●

●●●

●●

●●

●●●●

●

●●

●●●

●●●

●

●●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●●●

●●

●●

●

●●

●

●

●

●●

●

●●

●●

●●

●●

●●

●

●●

●

●●

●●●

●

●●●

●●

●

●●●

●●●

●

●

●●

●

●●

●

●●

●●

●

● ●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●●

●●

●●

●

●

●●●

●

●●●

●

●

●●

●

●

●●

●●

●●●●

●

●●●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●●●●●●

●●

●

●●●

●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●● ●●

●

●

−3 −2 −1 0 1 2 3

−10

−50

5

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 6.6: The qq-plot of our naive model to forecast the next opening price of theHSBC stock. The results seem to be reasonable.

qqnorm(m2$residuals)qqline(m2$residuals,col='blue',lwd=2)

Figure 6.6 shows that the model does capture well the tail-behaviour ofthe forecasted variable. However the predicting power is not great.

Testing the model robustnessTo check the robustness of our model we should now check how well

it fits the test-data. The idea is that since the model was built only on thetraining data, that we can assess its robustness by checking how well itdoes on the test-data.

First we prepare the test data in the same way as the training data:

m.qm.tst <- specifyModel(Next(Op(HSBC.test)) ˜ Ad(HSBC.test)+ Hi(HSBC.test) - Lo(HSBC.test) + Vo(HSBC.test))

D.tst <- modelData(m.qm.tst)D.tst$diff.HSBC.test <- D.tst$Hi.HSBC.test-D.tst$Lo.HSBC.test#tail(D.tst) # the last value is NAD.tst <- D[-nrow(D.tst),] # remove the last value that is NA

colnames(D.tst) <- c("Next.Op","Ad","Hi","Lo","Vo","Diff")

For the ease of reference we will name the coefficients.



a <- coef(m2)['(Intercept)']bAd <- coef(m2)['D$Ad']bD <- coef(m2)['D$Diff']est <- a + bAd * D.tst$Ad + bD * D.tst$Diff

Now we can calculate all possible measures of model power.

# -- Mean squared prediction error (MSPE)#sqrt(mean(((predict(m2,newdata = D.tst) - D.tst$Next.Op)ˆ2)))sqrt(mean(((est - D.tst$Next.Op)ˆ2)))

## [1] 3.391951327

# -- Mean absolute errors (MAE)mean((abs(est - D.tst$Next.Op)))

## [1] 2.673257503

# -- Mean absolute percentage error (MAPE)mean((abs(est - D.tst$Next.Op))/D.tst$Next.Op)

## [1] 0.05690119751

# -- squared sum of residualsprint(sum(residuals(m2)ˆ2))

## [1] 17696.08207

# -- confidence intervals for the modelprint(confint(m2))

## 2.5 % 97.5 %## (Intercept) 4.616446227 7.909567572## D$Ad 1.086357955 1.169849160## D$Diff 3.467223157 4.652205859

These values give us an estimate on what error can be expected by usingthis simple model.

# -- compare the coeficients in a refitm3 <- lm(D.tst$Next.Op ˜ D.tst$Ad + D.tst$Diff)summary(m3)

#### Call:## lm(formula = D.tst$Next.Op ˜ D.tst$Ad + D.tst$Diff)#### Residuals:## Min 1Q Median 3Q


CHAPTER 6. EXAMPLES

## -13.7228832 -2.0498510 0.1472862 2.1291195## Max## 8.6319089#### Coefficients:## Estimate Std. Error t value## (Intercept) 6.25463504 0.84023425 7.44392## D.tst$Ad 1.12828539 0.02129895 52.97375## D.tst$Diff 4.06197168 0.30226009 13.43866## Pr(>|t|)## (Intercept) 0.00000000000016211 ***## D.tst$Ad < 0.000000000000000222 ***## D.tst$Diff < 0.000000000000000222 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.395264 on 1535 degrees of freedom## Multiple R-squared: 0.6493976,Adjusted R-squared: 0.6489408## F-statistic: 1421.589 on 2 and 1535 DF, p-value: < 0.00000000000000022204

One will notice that the estimates for the coefficients are close to thevalues found in model m2.

Finally one could compare the models fitted on the training data andon the test-data and consider if on what time horizon the model shouldbe calibrated before use? One can consider the whole data-set, the last 5years, the training data-set, etc. The choice will depend on the reality ofthe environment rather than on naive mathematics. Though one machine-learning approach would consist of using all possible data-horizons andfinding the optimal one.


6.2. PREDICTING AWARDS

6.2 predicting Awards

An example from UCLAafter: https://stats.idre.ucla.edu/r/dae/poisson-regression/

myURL <- "https://stats.idre.ucla.edu/stat/data/poisson_sim.csv"p <- read.csv(myURL)p <- within(p, {

prog <- factor(prog, levels=1:3,labels=c("General", "Academic", "Vocational"))id <- factor(id)})

summary(p)

## id num_awards prog## 1 : 1 Min. :0.00 General : 45## 2 : 1 1st Qu.:0.00 Academic :105## 3 : 1 Median :0.00 Vocational: 50## 4 : 1 Mean :0.63## 5 : 1 3rd Qu.:1.00## 6 : 1 Max. :6.00## (Other):194## math## Min. :33.000## 1st Qu.:45.000## Median :52.000## Mean :52.645## 3rd Qu.:59.000## Max. :75.000##

Each variable has 200 valid observations and their distributions seemquite reasonable. The unconditional mean and variance of our outcomevariable are not extremely different. Our model assumes that these values,conditioned on the predictor variables, will be equal (or at least roughlyso).

Checking the dataWe can use the tapply() function to display the summary statistics

by program type. The table below shows the average numbers of awardsby program type and seems to suggest that program type is a good candi-date for predicting the number of awards, our outcome variable, becausethe mean value of the outcome appears to vary by prog. Additionally, themeans and variances within each level of progthe conditional means andvariancesare similar. A conditional histogram separated out by programtype is plotted to show the distribution.


CHAPTER 6. EXAMPLES

0

10

20

30

40

50

0 2 4 6

num_awards

coun

t

prog

General

Academic

Vocational

Figure 6.7: The histograms for each program.

# Install the necessary packagesrequire(ggplot2)require(sandwich)

## Loading required package: sandwich

require(msm) #also requires 'Matrix, survival, expm

## Loading required package: msm## ## Attaching package: ’msm’## The following object is masked from ’package:boot’:## ## cav

# Show the mean and standard deviation for each programwith(p, tapply(num_awards, prog, function(x) {

sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))}))

## General Academic## "M (SD) = 0.20 (0.40)" "M (SD) = 1.00 (1.28)"## Vocational## "M (SD) = 0.24 (0.52)"

# Show the dataggplot(p, aes(num_awards, fill = prog)) +

geom_histogram(binwidth=.5, position="dodge")



Analysis methods you might considerBelow is a list of some analysis methods you may have encountered.

Some of the methods listed are quite reasonable, while others have eitherfallen out of favor or have limitations.

• Poisson regression Poisson regression is often used for modelingcount data. Poisson regression has a number of extensions useful forcount models.

• Negative binomial regression Negative binomial regression can beused for over-dispersed count data, that is when the conditional vari-ance exceeds the conditional mean. It can be considered as a general-ization of Poisson regression since it has the same mean structure asPoisson regression and it has an extra parameter to model the over-dispersion. If the conditional distribution of the outcome variableis over-dispersed, the confidence intervals for Negative binomial re-gression are likely to be narrower as compared to those from a Pois-son regression.

• Zero-inflated regression model Zero-inflated models attempt to ac-count for excess zeros. In other words, two kinds of zeros are thoughtto exist in the data, true zeros and excess zeros. Zero-inflated modelsestimate two equations simultaneously, one for the count model andone for the excess zeros.

• OLS regression Count outcome variables are sometimes log-transformedand analyzed using OLS regression. Many issues arise with this ap-proach, including loss of data due to undefined values generated bytaking the log of zero (which is undefined) and biased estimates.

At this point, we are ready to perform our Poisson model analysis usingthe glm function. We fit the model and store it in the object m1 and get asummary of the model at the same time.

Visualizing the data

# Poisson Regression:m1 <- glm(num_awards ˜ prog + math, family="poisson", data=p)summary(m1)

#### Call:## glm(formula = num_awards ˜ prog + math, family = "poisson", data = p)#### Deviance Residuals:## Min 1Q Median 3Q## -2.2043381 -0.8436418 -0.5105865 0.2557721


CHAPTER 6. EXAMPLES

## Max## 2.6795766#### Coefficients:## Estimate Std. Error z value## (Intercept) -5.2471244 0.6584531 -7.96887## progAcademic 1.0838592 0.3582530 3.02540## progVocational 0.3698092 0.4410703 0.83844## math 0.0701524 0.0105992 6.61865## Pr(>|z|)## (Intercept) 0.0000000000000016014 ***## progAcademic 0.002483 **## progVocational 0.401786## math 0.0000000000362500995 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 287.67223 on 199 degrees of freedom## Residual deviance: 189.44962 on 196 degrees of freedom## AIC: 373.5045#### Number of Fisher Scoring iterations: 6

Cameron and Trivedi (2009) recommended using robust standard er-rors for the parameter estimates to control for mild violation of the distri-bution assumption that the variance equals the mean. We use R packagesandwich below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculatedthe 95% confidence interval using the parameter estimates and their robuststandard errors.

Covariance

cov.m1 <- vcovHC(m1, type="HC0")std.err <- sqrt(diag(cov.m1))r.est <- cbind(Estimate= coef(m1), "Robust SE" = std.err,

"Pr(>|z|)" = 2 * pnorm(abs(coef(m1)/std.err),lower.tail=FALSE),LL = coef(m1) - 1.96 * std.err,UL = coef(m1) + 1.96 * std.err)

r.est

## Estimate Robust SE## (Intercept) -5.24712439854 0.6459983880



## progAcademic 1.08385914562 0.3210481576## progVocational 0.36980922984 0.4004173108## math 0.07015239749 0.0104351647## Pr(>|z|) LL## (Intercept) 0.0000000000000004566629828 -6.51328123901## progAcademic 0.0007354744824167541803611 0.45460475679## progVocational 0.3557156842088270432000741 -0.41500869924## math 0.0000000000178397516955338 0.04969947468## UL## (Intercept) -3.98096755807## progAcademic 1.71311353445## progVocational 1.15462715892## math 0.09060532031

Now lets look at the output of function glm more closely.

• The output begins with echoing the function call. The information ondeviance residuals is displayed next. Deviance residuals are approx-imately normally distributed if the model is specified correctly.In ourexample, it shows a little bit of skeweness since median is not quitezero.

• Next come the Poisson regression coefficients for each of the variablesalong with the standard errors, z-scores, p-values and 95% confidenceintervals for the coefficients. The coefficient for math is .07. Thismeans that the expected log count for a one-unit increase in math is.07. The indicator variable progAcademic compares between prog =Academic and prog = General, the expected log count for prog = Aca-demic increases by about 1.1. The indicator variable prog.Vocationalis the expected difference in log count ((approx .37)) between prog =Vocational and the reference group (prog = General).

• The information on deviance is also provided. We can use the resid-ual deviance to perform a goodness of fit test for the overall model.The residual deviance is the difference between the deviance of thecurrent model and the maximum deviance of the ideal model wherethe predicted values are identical to the observed. Therefore, if theresidual difference is small enough, the goodness of fit test will notbe significant, indicating that the model fits the data. We concludethat the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statis-tically significant, it would indicate that the data do not fit the modelwell. In that situation, we may try to determine if there are omittedpredictor variables, if our linearity assumption holds and/or if thereis an issue of over-dispersion.


CHAPTER 6. EXAMPLES

with(m1, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)))

## res.deviance df p## [1,] 189.4496199 196 0.6182274457

We can also test the overall effect of prog by comparing the devianceof the full model with the deviance of the model excluding prog. The twodegree-of-freedom chi-square test indicates that prog, taken together, is astatistically significant predictor of num awards.

## update m1 model dropping progm2 <- update(m1, . ˜ . - prog)## test model differences with chi square testanova(m2, m1, test="Chisq")

## Analysis of Deviance Table#### Model 1: num_awards ˜ math## Model 2: num_awards ˜ prog + math## Resid. Df Resid. Dev Df Deviance Pr(>Chi)## 1 198 204.02130## 2 196 189.44962 2 14.571682 0.00068517 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sometimes, we might want to present the regression results as incidentrate ratios and their standard errors, together with the confidence interval.To compute the standard error for the incident rate ratios, we will use theDelta method. To this end, we make use the function delta-method imple-mented in R package msm.

s <- deltamethod(list(˜ exp(x1), ˜ exp(x2), ˜ exp(x3),˜ exp(x4)), coef(m1), cov.m1)

The output above indicates that the incident rate for prog = Academic is2.96 times the incident rate for the reference group (prog = General). Like-wise, the incident rate for prog = Vocational is 1.45 times the incident ratefor the reference group holding the other variables at constant. The percentchange in the incident rate of num awards is by 7% for every unit increasein math. For additional information on the various metrics in which the re-sults can be presented, and the interpretation of such, please see RegressionModels for Categorical Dependent Variables Using Stata, Second Editionby J. Scott Long and Jeremy Freese (2006).

Sometimes, we might want to look at the expected marginal means. Forexample, what are the expected counts for each program type holding math



score at its overall mean? To answer this question, we can make use of thepredict function. First off, we will make a small data set to apply the predictfunction to it.

(s1 <- data.frame(math = mean(p$math),prog = factor(1:3, levels = 1:3, labels = levels(p$prog))))

## math prog## 1 52.645 General## 2 52.645 Academic## 3 52.645 Vocational

predict(m1, s1, type="response", se.fit=TRUE)

## $fit## 1 2 3## 0.2114109451 0.6249445914 0.3060085603#### $se.fit## 1 2 3## 0.07050108135 0.08628117282 0.08833706337#### $residual.scale## [1] 1

In the output above, we see that the predicted number of events forlevel 1 of prog is about .21, holding math at its mean. The predicted numberof events for level 2 of prog is higher at .62, and the predicted number ofevents for level 3 of prog is about .31. The ratios of these predicted counts((frac.625.211 = 2.96), (frac.306.211 = 1.45)) match what we saw looking atthe IRR.

We can also graph the predicted number of events with the commandsbelow. The graph indicates that the most awards are predicted for thosein the academic program (prog = 2), especially if the student has a highmath score. The lowest number of predicted awards is for those studentsin the general program (prog = 1). The graph overlays the lines of expectedvalues onto the actual points, although a small amount of random noisewas added vertically to lessen overplotting.

## calculate and store predicted valuesp$phat <- predict(m1, type="response")

## order by program and then by mathp <- p[with(p, order(prog, math)), ]

## create the plotggplot(p, aes(x = math, y = phat, colour = prog)) +


CHAPTER 6. EXAMPLES

0

2

4

6

40 50 60 70

Math Score

Expe

cted

num

ber o

f aw

ards

prog

General

Academic

Vocational

Figure 6.8: An overview of the data and model.

geom_point(aes(y = num_awards), alpha=.5,position=position_jitter(h=.2)) + geom_line(size = 1) +labs(x = "Math Score", y = "Expected number of awards")

Things to consider

• When there seems to be an issue of dispersion, we should first checkif our model is appropriately specified, such as omitted variables andfunctional forms. For example, if we omitted the predictor variableprog in the example above, our model would seem to have a problemwith over-dispersion. In other words, a misspecified model couldpresent a symptom like an over-dispersion problem.

• Assuming that the model is correctly specified, the assumption thatthe conditional variance is equal to the conditional mean should bechecked. There are several tests including the likelihood ratio testof over-dispersion parameter alpha by running the same model us-ing negative binomial distribution. R package pscl (Political ScienceComputational Laboratory, Stanford University) provides many func-tions for binomial and count data including odTest for testing over-dispersion.

• One common cause of over-dispersion is excess zeros, which in turnare generated by an additional data generating process. In this situa-tion, zero-inflated model should be considered.



• If the data generating process does not allow for any 0s (such as thenumber of days spent in the hospital), then a zero-truncated modelmay be more appropriate.

• Count data often have an exposure variable, which indicates the num-ber of times the event could have happened. This variable should beincorporated into a Poisson model with the use of the offset option.

• The outcome variable in a Poisson regression cannot have negativenumbers, and the exposure cannot have 0s.

• Many different measures of pseudo-R-squared exist. They all attemptto provide information similar to that provided by R-squared in OLSregression, even though none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of variouspseudo-R-squares, see Long and Freese (2006) or our FAQ page Whatare pseudo R-squareds?.

• Poisson regression is estimated via maximum likelihood estimation.It usually requires a large sample size.

References

• Cameron, A. C. and Trivedi, P. K. 2009. Microeconometrics UsingStata. College Station, TX: Stata Press.

• Cameron, A. C. and Trivedi, P. K. 1998. Regression Analysis of CountData. New York: Cambridge Press.

• Cameron, A. C. Advances in Count Data Regression Talk for the Ap-plied Statistics Workshop, March 28, 2009. http://cameron.econ.ucdavis.edu/racd/count.html.

• Dupont, W. D. 2002. Statistical Modeling for Biomedical Researchers:A Simple Introduction to the Analysis of Complex Data. New York:Cambridge Press.

• Long, J. S. 1997. Regression Models for Categorical and Limited De-pendent Variables. Thousand Oaks, CA: Sage Publications.

• Long, J. S. and Freese, J. 2006. Regression Models for Categorical De-pendent Variables Using Stata, Second Edition. College Station, TX:Stata Press.


CHAPTER 6. EXAMPLES


PART II

Appendices

PART II .:. CHAPTER 7 .:.

Other Resources

• https://www.tutorialspoint.com/r/r lists.htm

• http://www.r-tutor.com/r-introduction

• https://www.math.columbia.edu/ ik/tutor.pdf

• http://www.cyclismo.org/tutorial/R/time.html

• http://www.stats.uwo.ca/faculty/aim/2015/3859/R-scripts/PredictVolatility.html

• https://www.youtube.com/watch?v=vUVAaDqz4cs

• https://www.r-bloggers.com/r-and-finance/

CHAPTER 7. OTHER RESOURCES


PART II .:. CHAPTER 8 .:.

Levels of Measurement

Levels of MeasurementIntroduction

It is customary to refer to the theory of scales as having been developedby Stevens (1946). In that paper he argues that all measurement is doneby assuming a certain scale type. He distinguished four different types ofscale: nominal, ordinal, interval, and ratio scales.

CHAPTER 8. LEVELS OF MEASUREMENT

8.1 Nominal Scale

The nominal scale is the simplest form of classification. It simply containslabels that do not even assume an order. Examples include asset classes,first names, countries, days of the month, weekdays, etc. It is not possibleto use statistics such as average or median, and the only thing that can bemeasured is which label occurs the most (modus of mode).

Scale Type NominalCharacterization labels (e.g. asset classes, stock ex-

changes)Permissible Statistics mode (not median or average), chi-

squarePermissible Scale Transforma-tion

equality

Structure unordered set

Table 8.1: Characterization of the Nominal Scale of Measurement.Note that it is possible to use numbers as labels, but that this is very

misleading. When using an nominal scale, none of the traditional metrics(such as averages) can be used.


8.2. ORDINAL SCALE

8.2 Ordinal Scale

This scale type assumes a certain order. An example is a set of labels suchas very safe, moderate, risky, very risky. Bond rating such as AAA, BB+,etc. also are ordinal scales: they indicate a certain order, but there is no wayto determine if the distance between, say, AAA and AA- is similar to thedistance between BBB and BB-. It may make sense to talk about a median,but it does not make any sense to calculate an average (as is sometimesdone in the industry and even in regulations)

Scale Type Ordinal ScaleCharacterization ranked labels (e.g. ratings for bonds

from rating agencies)Permissible Statistics median, percentilePermissible Scale Transforma-tion

order

Structure (strictly) ordered set

Table 8.2: Characterization of the Ordinal Scale of Measurement.Ordinal labels can be replaced by others if the strict order is conserved

(by a strict increasing or decreasing function). For example AAA, AA-, andBBB+ can be replaced by 1, 2 and, 3 or even by -501, -500, and 500,000.The information content is the same, the average will have no meaningfulinterpretation.



8.3 Interval Scale

This scale can be used for many quantifiable variables: temperature (indegrees Celsius). In this case, the difference between 1 and 2 degrees is thesame as the difference between 100 and 101 degrees, and the average hasa meaningful interpretation. Note that the zero point has only an arbitrarymeaning, just like using a number for an ordinal scale: it can be used as aname, but it is only a name.

Scale Type Interval ScaleCharacterization difference between labels is meaning-

ful (e.g. the Celsius scale for temper-ature)

Permissible Statistics mean, standard deviation, correla-tion, regression, analysis of variance

Permissible Scale Transforma-tion

affine

Structure affine line

Table 8.3: Characterization of the Interval Scale of Measurement.Rescaling is possible and remains meaningful. For example, a con-

version from Celsius to Fahrenheit is possible via the following formula,Tf = 9

5Tc + 32, with Tc the temperature in Celsius and Tf the temperaturein Fahrenheit.

An affine transformation is a linear transformation of the form y =A.x+b. In Euclidean space an affine transformation will preserve collinear-ity (so that lines that lie on a line remain on a line) and ratios of distancesalong a line (for distinct collinear points p1, p2, p3, the ratio ||p2− p1||/||p3−p2|| is preserved).

In general, an affine transformation is composed of linear transforma-tions (rotation, scaling and/or shear) and a translation (or “shift”). Anaffine transformation is an internal operation and several linear transfor-mations can be combined into one transformation.


8.4. RATIO SCALE

8.4 Ratio Scale

Using the Kelvin scale for temperature allows us to use a ratio scale: herenot only the distances between the degrees but also the zero point is mean-ingful. Among the many examples are profit, loss, value, price, etc. Also acoherent risk measure is a ratio scale, because of the property translationalinvariance implies the existence of a true zero point.

Scale Type Ratio ScaleCharacterization a true zero point exists (e.g. VAR,

VaR, ES)Permissible Statistics geometric mean, harmonic mean, co-

efficient of variation, logarithms, etc.Permissible Scale Transforma-tion

multiplication

Structure field

Table 8.4: Characterization of the Ratio Scale of Measurement.




References

CHAPTER 8. REFERENCES


Bibliography

De Brouwer, P. J. S. (2012). Maslowian Portfolio Theory, a Coherent Approachto Strategic Asset Allocation. Brussels: VUBPress.

Gini, C. (1912). Variabilita e mutabilita.

Hara, A. and Y. Hayashi (2012). Ensemble neural network rule extractionusing re-rx algorithm. In Neural Networks (IJCNN), The 2012 InternationalJoint Conference on, pp. 1–6. IEEE.

Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statisticallearning. Springer.

Jacobsson, H. (2005). Rule extraction from recurrent neural networks: Atax-onomy and review. Neural Computation 17(6), 1223–1263.

Setiono, R. (1997). Extracting rules from neural networks by pruning andhidden-unit splitting. Neural Computation 9(1), 205–225.

Setiono, R., B. Baesens, and C. Mues (2008). Recursive neural network ruleextraction for data with mixed attributes. IEEE Transactions on NeuralNetworks 19(2), 299–307.

Stevens, S. S. (1946). On the theory of scales of measurement. Sci-ence 103(2684), 677–680.

Tickle, A. B., R. Andrews, M. Golea, and J. Diederich (1998). The truthwill come to light: Directions and challenges in extracting the knowledgeembedded within trained artificial neural networks. IEEE Transactions onNeural Networks 9(6), 1057–1068.

BIBLIOGRAPHY


Index

x, 92(), 72.Net, 13.xls, 47.xlsx, 47.xlx, 56LATEX, 164

abline(), 103accuracy(), 82Ad(), 194addition, 15, 38adjusted, 194analysis

ANOVA, 127survival, 48

analysis of variance, 126and, 40ANN, 160anova, 138anova(), 126ANOVA analysis, 127aov, 126aov(), 126apply(), 28apply(), 28apropos(), 44area under curve, 115ARIMA, 84arithmetic mean, 93ARMA, 84array, 27artificial neural network, 160as.Date(), 54as.numeric(), 193

assignment, 17chained, 40left, 40right, 40

attr(), 125AUC, 115auto.arima(), 83autoregressive integrated moving

average model, 84autoregressive moving average

model, 84

backtesting, 81bads, 118bagging, 139barchart, 59barChart(), 187barplot(), 59barplot(), 59batch mode, 15BatchGetSymbols, 186beta distribution, 72bias term, 164big data, 49bigger or equal, 39bigger than, 39binomial, 106binomial distribution, 72, 74bootstrapping, 174boxplot, 61boxplot(), 62boxplo(), 61break, 42buildData(), 196

BIBLIOGRAPHY

C, 13c(), 44C++, 13, 49candleChart(), 187car, 48caret, 48CART, 135cauchy distribution, 72cbind(), 207cbind(), 35cdf, 71central tendency, 91

mean, 91median, 94mode, 95

chartbar, 59boxplot, 61histogram, 64line, 68pie, 58scatterplot, 65stacked barcharts, 60

charts, 58chi square test, 100chi-squared distribution, 72chisq.test(), 100Cl(), 194class(), 18classification, 130Classification and Regression Tree,

135close, 194coef(), 105complex numbers, 18confint(), 112, 201control.tree(), 142cor(), 98correlation

Pearson, 98Spearman, 98

covariation, 98cross validation, 171cross-validation, 176CSV, 52, 187cumulative density function, 71curve(), 70cut(), 32cv.glm(), 171

dailyReturn(), 195data

big, 49financial, 48spacial, 48

data frame, 32data(), 44data.entry(), 33data.table, 49database

MySQL, 56dataframe(), 32date, 54dbConnect(), 56dbDisconnect(), 57dbinom(), 75dbSendQuery(), 57dbWriteTable(), 57de(), 33decision tree, 131Delt(), 195demo(), 43dendrogram, 177dev.off(), 58devtools, 49dimnames, 24distribution

beta, 72binomial, 72, 74cauchy, 72chi-squared, 72exponential, 72f, 72gamma, 72geometric, 72hypergeometric, 72log-normal, 72logistic, 72negative binomial, 72normal, 72Poisson, 72, 107t, 72uniform, 72weibull, 72

division, 38dnorm(), 72dot-product, 25dplyr, 47

edit(), 33, 45


BIBLIOGRAPHY

equal, 39Excel, 47, 56exponential smoothing, 84exponential distribution, 72exponential model, 87expression(), 70

f(.), 92f distribution, 72factor(), 29factors, 29false negative, 117False Negative Rate, 117false positive, 117False Positive Rate, 117fetch(), 57financial data, 48fix(), 45FN, 117FNR, 117for, 41forecast(), 83forecasting, 79formatting, 50Fortran, 13FP, 117FPR, 117FreeBSD, 95function(), 15, 44functions, 43

gamma distribution, 72generalized additive models, 48geometric mean, 93geometric distribution, 72getFX(), 187, 192getMetals(), 187getSymbols(), 185getSymbols(), 186ggmaps, 48ggplot, 48ggplot(), 116ggvis, 47Gini, 120gl(), 31gl(), 31glm(), 106, 108glm(), 206glmnet, 48goods, 118

googleVis, 47graphs, 58gregexpr(), 51grep(), 51grepl(), 51, 185gsub(), 51

harmonic mean, 93has.Cl(), 194has.Hi(), 194has.OHLS(), 194head(), 33heat-map, 177heatmap(), 177help(), 44Hi(), 194HiCl(), 194hist(), 64hist(), 64histogram, 64holder mean, 93Holt Exponential Smoothing, 85html, 48htmlwidgets, 47hypergeometric distribution, 72hypothesis, 130

IDE, 15if(), 185import

csv, 52induction learning, 130inferential statistics, 101install.packages(), 185install.packages(), 46integrated development

environment, 15interval scale, 220is.OHLC(), 194

knitr, 164Kolmogorov Smirnov, 120KS, 120ks.stats(), 125

Lag(), 195last(), 191LaTeX, 48, 70learning

induction, 130


BIBLIOGRAPHY

reinforced, 130supervised, 130unsupervised, 130

length(), 21level of measurement, 217library(), 185library(), 46line plot, 68linear regression, 101lineChart(), 187Linux, 14, 95list, 18, 21list(), 19, 21lm(), 104lme4, 48Lo(), 194Loess, 86log-normal distribution, 72logic, 15logical operators, see logiclogistic distribution, 72logistic regression, 106logit, see logisticlong, 18loop, 41

for, 41repeat, 42while, 42

ls(), 17, 44lubridate, 47

Mac, 14MAD, 96, 115mad(), 97maps, 48maptools, 48Markdown, 48MASS, 105matrix, 24matrix(), 24matrix(), 24max(), 44mean, 91, 92

arithmetic, 91generalized, 92geometric, 93harmonic, 93holder, 93power, 93quadratic, 93

mean average deviation, 115mean square error, 113mean(), 44, 46measure

central tendency, 91measures of spread, 96median, 94Median Absolute Deviation, 96merge(), 36mgcv, 48mixed effects models, 48mode, 95mode(), 95model

exponential, 87generalized additive, 48log-linear, 107mixed effects, 48Poisson, 107

model performance, 113modelData(), 196monthlyReturn(), 195moving average, 80MSE, 113msm, 208mtcars, 31multcomp, 48multiple linear regression, 104multiplication, 38MySQL, 47, 56

Nasdaq, 186nchar(), 51ndays(), 192negative binomial distribution, 72neural network, 160New York Stock Exchange, 186Next(), 195nlme, 48nls(), 110NN, 160nominal scale, 218non-linear regression, 110normal distribution, 72not, 40not equal, 39nweeks(), 192nyears(), 192NYSE, 186


BIBLIOGRAPHY

Object Oriented, 22OLS, 130, 205OO, 22OpCl(), 194open, 194operator

arithmetic, 38assignment, 40logical, 39other, 40relational, 38

or, 40ordinal scale, 219ordinary least squares, 130, 205

P(.), 92pairs(), 33parallel, 49parallel processing, 49paste(), 44, 45, 50pbinom(), 75pdf, 48, 71, 92period.max(), 193period.min(), 193periodicity(), 192periodReturn(), 195pie(), 58pie chart, 58pie(), 58plot

bar, 59boxplot, 61functions(of), 70histogram, 64line, 68pie, 58scatter, 65stacked barplots, 60

plot(), 164plot(), 29, 66plyr, 171pnorm(), 72points(), 70Poisson, 107Poisson distribution, 72Poisson regression, 107, 205PostgreSQL, 47power, 15, 38power mean, 93predict.tree(), 142

print(), 19probability, 92probability density function, 71, 92pROC, 116pROC, 115processing

parallel, 49product, 15prp(), 150prune.tree(), 142pruning, 133

q(), 44qbinom(), 75qnorm(), 72qq-plot, 74, 199qqplot2, 47quantile function, 71quantmod, 48, 185quarterlyReturn(), 195

R, 13, 14R-squared, 113random, 71random forest, 48, 155randomForest(), 157ratio scale, 221rbind(), 28, 35rbinom(), 75Rcpp, 49read.csv(), 47read.fwf(), 47read.table(), 47readLines(), 52realizable, 130Receiver Operating Characteristic,

117regexec(), 51regexpr(), 51regression, 130

elastic-net, 48lasso, 48linear, 101logistic, 106multiple linear, 104non-linear, 110Poisson, 107, 205

reinforced learning, 130relational operators, 38repeat, 42


BIBLIOGRAPHY

reshape, 47resid(), 112residuals(), 112, 201response variable, 107rgl, 47rm(), 17RMySQL, 47, 56rnorm(), 72ROC, 117RODBC, 47roxygen2, 49rpart, 142rpart.plot, 150RPostgresSQL, 47RSQLite, 47RStudio, 48

sampling, 81, 174SAS, 47scale

interval, 220nominal, 218ordinal, 219ratio, 221

scan(), 15scatterplot, 65sd(), 96seasonal trend, 86seq(), 44seriesHi(), 194ses(), 85set.seed(), 155shapefiles, 48shiny, 48shortcut, 36smaller than, 39snip.tree(), 142sort(), 20sorting, 20source(), 15sp, 48SP500, 77spacial data, 48specifyModel(), 195, 197specifyModel(), 195spread

standard deviation, 96SQLLite, 47sqrt(), 207SS, 115

standard deviation, 96statistics, 91

inferential, 101STL, 86stockSymbols(), 186storage.mode(), 95string, 18, 49stringr, 47strsplit(), 51sub(), 51subset(), 54substraction, 38substring(), 51sum of squares, 115sum(), 44supervised learning, 130survival, 116survival analysis, 48

t distribution, 72tail(), 194tail(), 33temperature, 21test

chi square, 100testing

multiple comparison, 48visualization, 48

testthat, 49text(), 70text mining, 180tidyr, 47time series, 48, 77timestamp, 77tm map(), 181TN, 117TNR, 117to.minutes10(), 192to.minutes5(), 192to.monthly(), 192tolower(), 51toupper(), 51TP, 117TPR, 117training, 160tree, 140tree(), 141true negative, 117True Negative Rate, 117true positive, 117


BIBLIOGRAPHY

True Positive Rate, 117ts(), 77type I error, 117type II error, 117

uniform distribution, 72unique(), 95unknown variable, 107unlist(), 24unrealizable, 130unsupervised learning, 130

variableunknown, 107

variables, 17variance, 96vcd, 48vector, 18vi, 45

Vo(), 194volume, 194

web-apps, 48weeklyReturn(), 195weibull distribution, 72while, 42Windows, 14word cloud, 180wordcloud(), 182

XML, 49xtable, 48xts, 48, 191xts, 190, 191

yearlyReturn(), 195

zoo, 48, 191


BIBLIOGRAPHY


Nomenclature

x mean, page 92

Cα(T ) cost of complexity function for the tree T and pruning parameter α,page 133

f() function, page 130

h() hypothesis, page 130

log() ln or the natural logarithm, page 136

xk an observation of the stochastic variable X , page 92

ANN artificial neural network, page 160

aov analysis of variance, page 126

ARIMA autoregressive integrated moving average model, page 84

ARMA autoregressive moving average model, page 84

AUC area under curve, page 115

CART Classification and Regression Tree, page 135

cdf cumulative density function, page 71

f(.) probability density function, page 92

FN false negative, page 117

FNR False Negative Rate, page 117

FP false positive, page 117

FPR False Positive Rate, page 117

IDE integrated development environment, page 15

NOMENCLATURE

KS Kolmogorov Smirnov, page 120

MAD Median Absolute Deviation, page 96

MAD mean average deviation, page 115

MSE mean square error, page 113

Nasdaq , page 186

NN neural network, page 160

NYSE New York Stock Exchange, page 186

OLS ordinary least squares, page 130

OLS ordinary least squares, page 205

OO Object Oriented, page 22

P(.) probability, page 92

pdf probability density function, page 71

ROC Receiver Operating Characteristic, page 117

SS sum of squares, page 115

STL seasonal trend decomposition using Loess, page 86

TN true negative, page 117

TNR True Negative Rate, page 117

TP true positive, page 117

TPR True Positive Rate, page 117


Documents

A Practical Introduction to Quantitative Methods and R · A Practical Introduction to Quantitative Methods and R Get Started with R and Hands on with Statistical Learning and Quantitative