117
Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

Embed Size (px)

Citation preview

Page 1: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

1

Statistical lingua franca

Introduction to R

Lecture 222 Sept 2015

Kyrylo Bessonov

Page 2: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

2

Outline

1. Introduction to R language1. basics2. loops3. data structures

2. Visualization1. rich plotting options

3. Bioinformatics 1. repositories2. annotation of biological IDs

Page 3: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

3

Definition

• “R is a free software environment for statistical computing and graphics”1

• R is considered to be one of the most widely used languages amongst statisticians, data miners, bioinformaticians and others.

• R is free implementation of S language• Other commercial statistical packages are

SPSS, SAS, MatLab

1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/)

Page 4: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

4

History

• 1993 creation– by Ross Ihaka and Robert Gentleman at the

University of Auckland• Community extended via

– Packages• Facts

– initial interpreter ~ 1000 lines of C code– GNU project

Page 5: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

5

R language features

• Based on S language (commercial)– Created in 1976 by Bell Labs– Main moto: "to turn ideas into software, quickly and

faithfully" (John Chambers)• Memory allocation

– Fixed allocation at startup– ‘on-the-fly’ garbage collector– memory hungry

• Variables have lexical scoping– functions have access to the variables which were in the

effect when the function was defined

Page 6: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

6

Lexical scoping

make.counter <- function(x = 0){function(){ x <<- x + 1 cat("Inside the make.counter() the x is",x,"and not 5!\n") print(environment())}

}x <- 5cat("Global value of x is", x , "\n");print(environment())

counter <- make.counter()counter()print(x)

1) The value of x inside make.counter()is stored even after the function call and belongs to the function environment (i.e. scope)

2) The global value of x is not modified 3) The “inner function” (i.e. function()) can only access the local value of x

More details are given at the functions section of the tutorial…

The value of the variable is searched in the environment it is being called. If the value is not found, the search is continued in parent environment

Page 7: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

7

R language features• Flexible Plot Layouts

– The layout command divides up the area into a matrix. nf <- layout(matrix(c(1,1,0,2), 2, 2, byrow=TRUE), respect=TRUE)

– Calling layout.show()results in the layout being displayed for your reference.

Page 8: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

8

R language features

• A flexible statistical analysis toolkit– Statistical models

• Regression, ANOVA, GLM, trees

• Free data analysis– No subscription fees– Transparent code

• Is a language– Can write complex scripts– Not limited by the GUI

• Active community

Page 9: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

9

Why to learn R?

• R is widely used by bioinformaticians and statisticians

• It is multiplatform – iOS, Windows, Linux

• Large number of libraries• Main library repositories CRAN and

BioConductor

Page 10: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

10

In a nut-shell…

• R is a scripting language and, as such, is much more easier to learn than other compiled languages such as C

• R has reasonably well written documentation (vignettes)

• Syntax in R is simple and intuitive if one has basic statistics skills

• R scripts will be provided and explained in-class

Page 11: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

11

Installation of R

1. go to http://www.r-project.org/2. select CRAN (Comprehensive R Archive Network)

from left menu3. link to nearby geographic site4. select your operating system5. choose "Base" installation6. save R-X.X.X-win32.exe (windows) or R-X.X.X-

mini.dmg (Mac OS X)7. run the installation program accepting defaults

Page 12: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

12

TutorialIntroduction to R language

Part 1: Basics

Page 13: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

13

Topics covered in this tutorial

• Operators / Variables• Main objects types• Visualization

– plot modification functions• Writing and reading data to/from files• Bioinformatics applications

– Annotation of array probes

Page 14: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

14

Variables/Operators

• Variables store one element x <- 25

Here x variable is assigned value 25

• Check value assigned to the variable x>x

[1] 25• Basic mathematical operators :

– arithmetic: + , - , * , /– power: ^

• Use parenthesis to obtain desired sequence of mathematical operations

Page 15: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

15

Arithmetic operators

• What is the value of small z here?>x <- 25> y <- 15

> z <- (x + y)*2> Z <- z*z

> z[1] 80

Page 16: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

16

Logical operators

• These operators mostly work on vectors, matrices and other data types

• Type of data is not important, the same operators are used for numeric and character data types

Operator Description< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to!= not equal to!x Not xx | y x OR yx & y x AND y

Page 17: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

17

Logical operators

• Can be applied to vectors in the following way. The return value is either True or False

> v1[1] 48 2 3 4 5

> v1 <= 3[1] FALSE TRUE TRUE FALSE FALSE

Page 18: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

18

Key functions

• mathematical functions – sqrt, log, exp, sin, cos, tan

• simple functions– max, min, length, sum, mean, var, sort

• other useful functions– abs(-5) #absolute value– exp(8) # exponentiation– log(exp(8)) # natural logarithm– sqrt(64) # square root

Page 19: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

19

R workspace

• Display all workplace objects (variables, vectors, etc.) via ls():

>ls()[1] "Z" "v1" "x" "y" "z"

• Useful tip: to save “workplace” and restore from a file use:

>save.image(file = " workplace.rda")>load(file = "workplace.rda")

Page 20: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

20

How to find help info?• Any function in R has help information• To invoke help use ? Sign or help():

? function_name()? mean

help(mean, try.all.packages=T)

• To search in all packages installed in your R installation always use try.all.packages=T in help()

• To search for a key word in R documentation use help.search():

help.search("mean")

Page 21: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

21

Basic data types

• Data could be of 3 basic data types:– numeric– character– logical

• Numeric variable type / class:> x <- 1> mode(x)

[1] "numeric"   > class(x)

[1] "numeric"

Page 22: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

22

Basic data types

• Logical variable type (True/False):> y <- 3<4> mode(y)

[1] "logical"

• Character variable type:> z <- "Hello class"

> mode(z)[1] "character"

Page 23: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

23

Objects/Data structures• The main data objects in R are:

– Matrices (single data type)– Data frames (supports various data types)– Lists (contain set of vectors)– Other more complex objects with slots

• S3 and S4 objects

• Matrices are 2D objects (rows/columns) > m <- matrix(0,2,3)

> m[,1] [,2] [,3][1,] 0 0 0[2,] 0 0 0

Page 24: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

24

Vectors c()

• Vectors have only 1 dimension and represent enumerated sequence of data. They can also store variables

> v1 <- c(1, 2, 3, 4, 5)> mean(v1)

[1] 3

The elements of a vector are specified /modified with braces (e.g. [number])

> v1[1] <- 48> v1

[1] 48 2 3 4 5

Page 25: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

25

Lists

• Lists contain various vectors. Each vector in the list can be accessed by double braces [[number]]. Lists could be of different types

> x <- c(1, 2, 3, 4)> y <- c(2, 3, 4)> L1 <- list(x, y)

> L1[[1]]

[1] 1 2 3 4[[2]]

[1] 2 3 4

Page 26: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

26

Matrices

• Contain data of the same type– i.e. all cells are one of the 3 basic types

• Defined with matrix()– Arguments:

• initialization value (i.e. 0)• ncol: number of columns• nrow: number of rows

matrix(0,ncol=2,nrow=2) [,1] [,2][1,] 0 0[2,] 0 0

Page 27: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

27

Data frames• Data frames are similar to matrices but

can contain various data types> x <- c(1,5,10)

> y <- c("A", "B", "C")> z <-data.frame(x,y)

x y1 1 A2 5 B3 10 C

• To get/change column and row names use colnames() and rownames()

Page 28: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

28

Factors

• Factors are special– could be vectors of integers or characters– can identify levels

• i.e. unique variables in a vector> letters = c("A","B","C","A","C","C")

- the vector letters has 6 chars but 3 levels> letters = factor(letters)

[1] A B C A C CLevels: A B C

> summary(letters)A B C 2 1 3

- summary()allows to see - number of variables per level- unique variables (levels)

Page 29: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

29

Conversion between data types

• One can convert one type of data into another using as.xxx where xxx is a data type

Page 30: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

30

S4 objects• Allow to store more complex data strutures• Benefits

– Provide greater modularity– Check for data type errors during assignment– Cleaner code

• Every S4 objects belongs to a classrequire(methods)setClass("Box",

slots=c(box_name = "character", address = "numeric", files = "list“)

,prototype=list(box_name=NULL,address=NULL, files=list())

)

Page 31: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

31

S4 objects

• Each object is an instance of a classobject = new("Box")

• To access slots in the object use @ operator object@box_name object@address object@files• Assign values to slots object@box_name="pretty_box"

Page 32: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

32

S4 object functions

• Use functions to get and set values in objectssetMethod(f="getBoxName", signature="Box", definition=function(object){

return(object@box_name)})

setMethod(f="setBoxName", signature="Box", definition=function(object,name){

object@box_name <-name})

Page 33: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

33

Read Input/ Write Output

• To read data into R from a text file use read.table()– read help(read.table) to learn more– scan() is a more flexible alternativeraw_data <-read.table(file="data_file.txt")

• To write data into R from a text file use write.table()

> write.table(mydata, "data_file.txt")

Page 34: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

34

Loops

• Loops allow repetition operations– can check given condition– execute n number of times

• R is not specifically designed for loops – use them sparsely or replace by apply()– avoid nested loops

• Will look at– for loop– while loop– repeat loop

Page 35: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

35

For loop• Executed n number of times

– safe to use– variable i will take values found in the array (i.e. 1:10)

for(i in 1:10){print(i)

}

Page 36: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

36

while loop

• Will test infinitely for a particular condition until it becomes TRUE

i=1;while(i<=10){print(i)i=i+1

}

Page 37: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

37

repeat• similar to the while loop• will always begin the loop

– Executed at least once• Important to have a breaking condition

– otherwise, infinite loop

x <-1repeat { x <- x+1#if(x >= 5)break; }print(x)

Page 38: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

38

apply()

• A faster alternative to loops• Apply function to 2D data structures (matrix, df)apply(X, MARGIN, FUN, ...)X: is matrix or data frame (df)MARGIN: 1-columns and 2-rowsFUN: function to apply to rows or columnsM <- matrix(1:6, nrow=3, byrow=TRUE)

apply(M, 1, sum) #columnsapply(M, 2, sum) #rows

Page 39: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

39

lapply()

• is applied on lists / vectors• it does not work on matrices or higher-

dimensional arrays directly• returns list()M <- matrix(1:6, nrow=3, byrow=TRUE)lapply(1:2,function(x){sum(M[,x])})

L=list(sample(1:10,10))lapply(L,mean)

Page 40: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

40

sapply()

• the "s" in "sapply" stands for "simplify• simplifies output to a vector

M <- matrix(1:6, nrow=3, byrow=TRUE)sapply(1:2,function(x){sum(M[,x])})[1] 9 12

Page 41: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

41

tapply()• Apply function to subset data matrixtapply(summary, group, Function)

Summary variable: selected variable to apply function on Group variable: the filter variable used to split data Function: function to apply to resulting sets

medical.data <- data.frame(patient = 1:100, age = rnorm(100, mean = 60, sd = 12), treatment = gl(2, 50, labels = c("Treatment", "Control")))

tapply(medical.data$age, medical.data$treatment, mean)Treatment Control 61.10559 59.37849

Page 42: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

42

Conditional statements

• allow to make choices– specific parts of the code executed

• Add flexibility• Main statements

– if … else– switch

Page 43: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

43

If..else

if (test expression) { #code

} else if (test expression) {#code

} else () {#code

}• else()does not have any test expression

– accepts all cases not covered by if()or else if()• test expression returns logical value (T or F)

Page 44: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

44

If..elsesign <- function(x){if(x > 0){

print("x is a positive number"); } else if (x == 0){

print("x is zero"); } else{

print("x is negative") }}x<- 5; sign(x);x<- 0; sign(x);x<- -2; sign(x);

Page 45: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

45

swich()

switch (statement, list)• the statement is evaluated and value returned• based on this value, the corresponding item in the list is returned

y <- rnorm(5)x <- "sd"z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))print(z)sd(y)

Page 46: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

46

TutorialIntroduction to R language

Part 2: Functions

Page 47: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

47

Functions in R

• Functions encapsulate chucks of code– Helps with debugging and code readability– Variables passed by ‘pass by value’

add.function<-function(x){x+5

}> add.function(2)[1] 7

Page 48: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

48

Functions in R

• defined with the function() directive• stored as R objects• can be nested• The return() directive is the last expression in

the function body to be evaluated

f <- function() { # Do something interesting return(1)

}

Page 49: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

49

Arguments

• named arguments– optionally could have default values

• not every function call in R makes use of all arguments– i.e. function arguments could be missing

• then the default values are used

mydata <- sample(1:10,20, replace=T)sort(mydata)sort(mydata, decreasing=T)

Page 50: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

50

Arguments

• can be matched by– relative sequential position: f(arg1, arg2, arg3)

rnorm(100,0,1)– name: f(data=arg1, replace=arg2)

rnorm(n=100, mean = 0, sd = 1)– can mix positional and by name matching

mydata=data.frame(y=rnorm(100),x=rnorm(100))lm(data = mydata, y ~ x, model = FALSE, 1:100)

#equivalentlylm(y ~ x, mydata, 1:100, model = FALSE)

Page 51: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

51

Lazy evaluation

• Arguments to functions are evaluated lazily– i.e. R does not check the # of arguments supplied to

function– i.e. can give less but not more args to a function– args are evaluated as needed

f <- function(a,b) { a^2

}f(a=10) #no error even if b is missing

Page 52: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

52

The … argument• The ... argument indicates

– a variable # of arguments– adds flexibility and abstraction

• Uknown number of variables to be passed to a function apriori

x=1:10; y=rnorm(10)f <- function(x, y, ...) {cat("value of x:",x,"\n");cat("value of y:",y,"\n");plot(x,y, ...)

}f(x,y)#adding extra argument type=‘l’f(x,y, type="l")

Page 53: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

53

Scoping

• Variable value is searched through a series of environments– the global environment– the namespaces of loaded librariessearch()[1] ".GlobalEnv" "package:stats" "package:graphics"[4] "package:grDevices" "package:utils" "package:datasets"[7] "package:methods" "Autoloads" "package:base"

• The global environment– is the user’s workspace

Page 54: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

54

Scoping

• Loading a package via library()adds the namespace of that package to position 2 of the search list

library(igraph)search() [1] ".GlobalEnv" "package:igraph" "package:stats" [4] "package:graphics" "package:grDevices" "package:utils" [7] "package:datasets" "package:methods" "Autoloads"[10] "package:base"

Page 55: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

55

Scoping

• The scoping rules for R– different from the original S language– uses lexical scoping

• related on how R “searches” for the variable value

– simplifies statistical calculations

f <- function(x, y) {y=2x+z

}– z is a ‘free variable’ not defined in f()

Page 56: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

56

Searching for the value

• free variable value will be searched– in the environment in which a function was called– In the parent environment– In the top-level environment (i.e. global)– In the empty environment

• if value not found– error thrown

Page 57: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

57

Scoping• Typically, the user defined values are found

– in the workspace (i.e. global environment)– a function is defined in the global environment

• can have functions – defined inside other functions– the environment in which a function is defined

could be environment of another function• child parent environments

build.power <- function(n) {p <- function(x) {

x^n}print(p)

}square <- build.power(2)cube <- build.power(3)

Page 58: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

58

Lexical Scoping

• With lexical scoping – the variable value is looked up in the environment in

which the function was definedy <- 10f <- function(x) {y <- 2y^2

}f()[1] 4

Page 59: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

59

Other languages

• Lexical scoping is supported in– Perl– Python– …

Page 60: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

60

Dynamic scoping

• With dynamic scoping– The variable value is looked up in the environment

from which the function was called (the calling environment)

y <- 10f <- function(x) {y^2

}f()[1] 100

Page 61: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

61

Regression andGeneralized Linear Models Functions

Examples

Page 62: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

62

Regression• Regression analysis is used to

– explain or model the relationship between • a single variable Y, called the response, output or

dependent variable, and • one or more predictor, input, independent or

explanatory variables, X1, …, Xp

– p=1 simple regression– p>1 multivariate regression

Page 63: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

63

Main uses of regression analysis

• Prediction of future observations.• Assessment of the effect of, or relationship

between, explanatory variables on the response.

• A general description of data structure

Page 64: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

64

Regression in R

• the basic syntax for doing regression in R – lm(Y~model)to fit linear models– glm(Y ~model)to fit generalized linear

models.• where model could be specified following

the general syntax rules in R– consists of predictor terms– provides Y~X relationship

Page 65: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

65

Model syntax

Page 66: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

66

Simple linear regression

• The regression of Y on X is given by

for i = 1, … , p• Unknown parameters

– intercept • point in which the line intercepts the y-axis

– slope or coefficient• Increase in Y per unit change in X

Page 67: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

67

Objective

• To find the equation of the line that “best” fits the data.– find andsuch that the

fitted values of are as close as possible to observed

– Residuals• The difference between

– the observed value and the fitted value

Page 68: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

68

residual sum of squares(RSS)

• A usual way of calculating and– based on the minimization of the sum of the

squared residuals, or• residual sum of squares (RSS)

=

Page 69: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

69

Regression example

• thuesen dataset – Ventricular shortening velocity– 24 rows and 2 columns

• short.velocity (Y)• blood.glucose (X)

– It contains ventricular shortening velocity and blood glucose for type 1 diabetic patients.

Page 70: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

70

Regression examplefile="http://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO0009-1_TopInBioinf2015-16/lectures/L2/thuesen.txt"

data <- read.table(file, header=TRUE, stringsAsFactors=FALSE)options(na.action =na.exclude)fit.lm <- lm(short.velocity ~ blood.glucose, data=data)

data.frame(data, fitted.value=fitted(fit.lm), residual=resid(fit.lm))

blood.glucose short.velocity fitted.value residual1 15.3 1.76 1.433841 0.3261585322 10.8 1.34 1.335010 0.0049898823 8.1 1.27 1.275711 -0.0057113084 19.5 1.47 1.526084 -0.0560840625 7.2 1.27 1.255945 0.014054962

Page 71: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

71

Measuring Goodness of Fit

• Coefficient of Determination, R2

– Need to calculate TSS, RSS and SSreg• The ANOVA breaks the total variability observed

in the sample into two parts– TSS = SSreg + RSS– TSS: total sum of squares (entire sample variability)– SSreg: regression sum of squares

• Explained by fitted model

– RSS: residual sum of squares (unexplained variability)

Page 72: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

72

Calculating R2

anova(fit.lm)Analysis of Variance Table

Response: short.velocity Df Sum Sq Mean Sq F value Pr(>F) blood.glucose 1 0.20727 0.207269 4.414 0.0479 *Residuals 21 0.98610 0.046957 ---

R2 = SSreg/TSSR2 = 0.20727/(0.98610+0.20727)R2 = 0.1736846

Page 73: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

73

Getting lm() summarysummary(fit.lm)Call:lm(formula = short.velocity ~ blood.glucose, data = data)Residuals: Min 1Q Median 3Q Max -0.40141 -0.14760 -0.02202 0.03001 0.43490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.09781 0.11748 9.345 6.26e-09 ***blood.glucose 0.02196 0.01045 2.101 0.0479 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2167 on 21 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479

* - are shorthand for significance levelsEstimate - is the value of slope calculated by the regressionStd. Error - Measure of the variability in the estimate for the coefficient βt value - measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the significance levels.R-squared - Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best.DF - The Degrees of Freedom is the difference between the number of observations (24) and the number of variables used in your model minus 1 (intercept counts as a variable). DF = 24-2-1 = 23

Page 74: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

74

Generalized Linear Models Functions

• The glm() – is designed to perform generalized linear models regression

on outcome data that is• binary • count • proportion• continuous

– overcomes limitation of classical regression models• do not need to transform the response Y to have a normal

distribution N(0, σ2)

– limitations• assumes linear Y~X relationship

Page 75: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

75

glm()

• glm(formula, family = "gaussian", data, …)

– formula: symbolic description of the model to fit• e.g. y~x1+x2 or y~. or y~x1+x2+x1*x2

– family: type of distribution to apply to the response variable

– data: data frame with the variables • glm returns an object with slots• summary(obj)provides compact summary

Page 76: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

76

logistic regression exampleUsing data on admissions where rank of 1 represents the highest and 4 the lowest prestige. Determine the major variable/factor impacting the admission decision

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")head(mydata) admit gre gpa rank1 0 380 3.61 32 1 660 3.67 3fit <- glm(admit ~ gre + gpa + rank, family=binomial(link="logit"),

data=mydata) summary(fit)Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -3.449548 1.132846 -3.045 0.00233 **gre 0.002294 0.001092 2.101 0.03564 *gpa 0.777014 0.327484 2.373 0.01766 *rank -0.560031 0.127137 -4.405 1.06e-05 ***AIC: 467.44 (the lower the better)

Page 77: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

77

Exploring glm() objectattributes(fit)$names [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [7] "qr" "family"

"linear.predictors"[10] "deviance" "aic" "null.deviance"[13] "iter" "weights" "prior.weights"[16] "df.residual" "df.null" "y"[19] "converged" "boundary" "model"[22] "call" "formula" "terms"[25] "data" "offset" "control"[28] "method" "contrasts" "xlevels"

$class[1] "glm" "lm"

Page 78: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

78

Accessing glm S3 object values

• Can look at beta coefficients of each variablefit$coefficients• Can check how good is my fitted modelfit$deviance fit$aic• Use TAB key for auto-complete

Page 79: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

79

Regression in Genetics

•Response yj for j=1..n• where case(1) or control(0)• Could refer to two types of patients or phenotypes

•Genotypes gi for markers i=1..p•To run glm() need to code data– Additive– Dominant – Recesive

Page 80: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

80

Data

• There are genotypes for – 1018 individuals at 32 SNP markers– 32 columns give the marker genotypes

• AA-11, AG-12 and GG-22, with NA for missing

– The genotypes• 890kb region flanking the CYP2D6 gene• associated with the metabolism of drugs

file="http://www.well.ox.ac.uk/rmott/LECTURES/LOGISTIC_REGRESSION/ugeno.dat"data <- read.table(file, header=TRUE, stringsAsFactors=FALSE)

Page 81: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

81

Coding• Additive coding

Code genotypes as – AA(11) x=0, – AG(12) x=1, – GG(22) x=2

• Recessive codingCode genotypes as – AA x=0, – AG x=0, – GG x=1

• Dominant codingCode genotypes as – AA x=0, – AG x=1, – GG x=1

additive <- function( x ) { return(as.numeric(factor(x))-1)}

recessive <- function( x ) { return ( ifelse( additive(x) > 1, 1, 0 ) ) }

dominant <- function( x ) { return ( ifelse( additive(x) > 0, 1, 0 ) )}

these functions allow to convert a genotype vector in a certain way (i.e. coding)

Page 82: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

82

Fitting the additive model

• Additive codingx=apply(data[,-c(1:2)],2,additive)data_prep=as.data.frame(cbind(y=data$y,x))fit.add <- glm(y~m1, data=data_prep, family = 'binomial' )Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -2.8695 -0.9872

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 335.1 AIC: 339.1

Page 83: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

83

additive model example

Page 84: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

84

Fitting the dominant model

• Dominant codingx=apply(data[,-c(1:2)],2,dominant)data_prep=as.data.frame(cbind(y=data$y,x))fit.dom <- glm(y~m1, data=data_prep, family = 'binomial' )

Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -2.880 -1.004

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 336.2 AIC: 340.2

Page 85: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

85

Dominant model example

Page 86: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

86

Fitting the recessive model

• Recessive codingx=apply(data[,-c(1:2)],2,recessive)data_prep=as.data.frame(cbind(y=data$y,x))fit.recessive <- glm(y~m1, data=data_prep, family = 'binomial' )Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -3.126 -15.440

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 340.1 AIC: 344.1

Page 87: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

87

Recessive model example

Page 88: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

88

In-class exercisePart 1 of 2

Page 89: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

89

In-class Exercises

1) Create vectors a=(5, 6, 7) and b=(10,3,1) and obtain a total sum of their elements that are greater than 5

2) Calculate 3) Create matrix A and replace the element

located in the 3rd row and 3rd column by the sum of the 1st and 2nd row

Page 90: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

90

In-class Exercises

4) Write a function which takes a single argument which is a matrix A (see Q3). The function should return a matrix which is the same as the function argument but every odd number is doubled.

Page 91: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

91

Plots generation in R

• R provides very rich set of plotting possibilities

• The basic command is plot()• Each library has its own version of plot()

function• When R plots graphics it opens

“graphical device” that could be either a window or a file

Page 92: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

92

Plotting functions

• R offers following array of plotting functions

Function Descriptionplot(x) plot of the values of x variable on the y axis

plot(x,y)bi-variable plot of x and y values (both axis scaled based on values of x and y variables)

pie(y) circular pie-charboxplot(x) Plots a box plot showing variables via their quantileshist(x) Plots a histogram(bar plot)

Page 93: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

93

Plot modification functions

• Often R plots are not optimal at 1st• R has an array of graphical parameters

Consult here is the full list• Some of the graphical parameters can be

specified inside plot() or using other graphical functions such as lines()

Page 94: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

94

Plot modification functionsFunction Descriptionpoints(x,y) add points to the plot using coordinates specified in x and y vectorslines(x,y) adds a line using coordinates in x and y

mtext(text,side=3) adds text to a given margin specified by side number

boxplot(x) this a histogram that bins values of x into categories represented as barsarrows(x0,y0,x1,y1, angle=30, code=1)

adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn

abline(h=y) draws horizontal line at y coordinaterect(x1, y1, x2, y2) draws rectangle at x1, y1, x2, y2 coordinates

legend(x,y)plots legend of the plot at the position specified by x and y vectors used to generate a given plot

title() adds title to the plot

axis(side, vect)adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn

locator() used interactively to select locations with mouse

Page 95: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

95

Plot margins

• Outer margin– par(oma=c(3, 2, 2,1))

•Fig. margin– par(mar=c(5, 4, 4, 2))

•Note the num. order– down, left, up, right

Page 96: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

96

Visualization of data in Rcars <- c(1, 3, 6, 4, 9) trucks <- c(2, 5, 4, 5, 12) g_range <- range(0, cars, trucks)

plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE)

lines(trucks, type="o", pch=22, lty=2, col="red")axis(2, las=1, at=4*0:g_range[2])axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri")) box()

title(xlab="Days", col.lab=rgb(0,0.5,0)) title(ylab="Total", col.lab=rgb(0,0.5,0))

Page 97: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

97

Visualization of data in Rc1 <- c(1,3,6,4,9); c2 <- c(2,5,4,5,12); c3 <- c(4,4,6,6,16);autos_data <- cbind(c1,c2,c3); colnames(autos_data)<-c("cars", "trucks", "suvs");

barplot(as.matrix(autos_data), main="Autos", ylab= "Total", beside=TRUE, col=rainbow(5)) legend("topleft", c("Mon","Tue","Wed","Thu","Fri"), cex=0.8, bty="n", fill=rainbow(5))

Page 98: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

98

Visualization of data in R

#Expand right side of the margin for the legend par(xpd=T, mar=par()$mar+c(0,0,0,4))

#Graph autos using heat colors #put 10% of the space between each bar, and make #labels smaller with horizontal y-axis labels barplot(t(autos_data), main="Autos", ylab="Total", col=heat.colors(3), space=0.1, cex.axis=0.8, las=1, names.arg=c("Mon","Tue","Wed","Thu","Fri"), cex=0.8)

legend(6, 30, colnames(autos_data), cex=0.8, fill=heat.colors(3));

# Restore default margins par(mar=c(5, 4, 4, 2) + 0.1);

Page 99: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

99

Visualization of data in R

dotchart(t(autos_data), color=c("red","blue","darkgreen"), main="Dotchart for Autos", cex=0.8)

Page 100: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

100

Visualization of data in R

r <- rlnorm(1000); hist(r)

Page 101: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

101

Visualization of data in Rr <- rlnorm(1000) # Get the distributionh <- hist(r, plot=F, breaks=c(seq(0,max(r)+1, .1))) # Plot the distribution using log scale on both axes, and # use blue points plot(h$counts[h$counts > 0], log="xy", pch=20, col="blue", main="Log-normal distribution", xlab="Value", ylab="Frequency")

Page 102: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

102

In-class exercisesPart 2 of 2

Page 103: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

103

In-class exercises

1) create two perfectly correlated vectors (with ρ=1) and plot them.

2) Change data points to a dashed line3) Add a horizontal and vertical red lines to your plot at

coordinates: a) (1,10) and (4,4); b) (7,7) and (1,20)4) Add text to the plot with text() to the left and right

of the diagonal line5) Add title to your plot6) Add legend to the plot

Page 104: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

104

Installation of new libraries

• There are two main R repositories– CRAN– BioConductor

• To install package/library from CRANinstall.packages("seqinr")

To install packages from BioConductorsource("http://bioconductor.org/biocLite.R")

biocLite("GenomicRanges")

Page 105: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

105

The Comprehensive R Archive Network (CRAN)

• CRAN – package repository features 7154 available

packages– install.packages("packageName")– Some popular packages

• ggplot2 – beautiful plots• Matrix – sparse matrices• igraph – graphs and analysis thereof

Page 106: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

106

source("http://bioconductor.org/biocLite.R");biocLite("packageName")

• Repository of biology-related libraries in R– 178,856 packages

• Some list of libraries– biomaRt– IRanges

Page 107: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

107

Installation of the new libraries

• Download and install latest R version on your PC. Go to http://cran.r-project.org/

• Install following libraries by runninginstall.packages(c("seqinr", "ape", "GenABEL")

source("http://bioconductor.org/biocLite.R")biocLite(c("limma", "muscle",

"affy","hgu133plus2.db","Biostings"))

Page 108: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

108

Biological annotation with biomaRt

• Biological experiments require ID conversions– e.g. microarray probe id to gene name– e.g. mapping of SNP to gene symbol

• BioMart online service– can be accessed via web-GUI– programmatically via biomaRt

Page 109: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

109

BioMart• Go to http://central.biomart.org/converter/#!/ID_converter/

• Select genome assembly version• Paste or upload the ID list• E.g. covert SOX1 to GOID

Page 110: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

110

biomaRt

• Access BioMart programmatically– install the biomaRt librarysource("http://www.bioconductor.org/biocLite.R"); biocLite("biomaRt"); require(biomaRt);

- use the listMarts() to see the different databaseslistMarts()- we will use the ensembl martensMart<-useMart("ensembl")

Page 111: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

111

biomaRt- listDatasets() to see which data sets that are

available in the databaselistDatasets(ensMart)

- Will use ʻhomo sapiensʼ datasetensembl_hs_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

- List retrieved attributes (db fields)listAttributes(ensembl_hs_mart)[1:100,]

- Download dataensembl_df <- getBM( attributes=c("ensembl_gene_id", "ensembl_transcript_id", "hgnc_symbol","chromosome_name", "entrezgene"), mart=ensembl_hs_mart )

Page 112: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

112

biomaRt- List of genes to annotatemy_genes = c("ENSG00000197971", "ENSG00000153165", "ENSG00000159352", "ENSG00000146006", "ENSG00000149809", "ENSG00000204179", "ENSG00000213023", "ENSG00000115008", "ENSG00000130844","ENSG00000155363")

- Annotate IDsmy_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),]

ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name entrezgene66435 ENSG00000197971 ENST00000382582 MBP 18 415566038 ENSG00000153165 ENST00000409886 RGPD3 2 653489190545 ENSG00000159352 ENST00000368884 PSMD4 1 5710

Page 113: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

113

In-class exercise

1) Map SNP rs2066844 to a gene with biomaRt library. What is its gene symbol, full gene name and gene biological function?

Page 114: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

114

References

[1] Team, R. Core. "R Language Definition." (2000).[2] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440.

Page 115: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

115

Page 116: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

116

In-class exercises (answers) s.103x=seq(1,10)y=2*seq(1,10)lines(c(1,10),c(4,4))lines(c(7,7),c(1,20))text(5, 8, "This is my plot", adj = c(0,0))plot(x,y, type="c")legend(9,15, "data" , lty=1, col=c('red'), cex=0.8)

Page 117: Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

117

In-class exercises (answers). s113library(biomaRt)snp_db=useMart("snp", dataset="hsapiens_snp")

getBM(attributes="associated_gene", filters = "snp_filter", values = "rs2066844", mart=snp_db)