Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1

Preview:

Citation preview

1

Statistical lingua franca

Introduction to R

Lecture 222 Sept 2015

Kyrylo Bessonov

2

Outline

1. Introduction to R language1. basics2. loops3. data structures

2. Visualization1. rich plotting options

3. Bioinformatics 1. repositories2. annotation of biological IDs

3

Definition

• “R is a free software environment for statistical computing and graphics”1

• R is considered to be one of the most widely used languages amongst statisticians, data miners, bioinformaticians and others.

• R is free implementation of S language• Other commercial statistical packages are

SPSS, SAS, MatLab

1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/)

4

History

• 1993 creation– by Ross Ihaka and Robert Gentleman at the

University of Auckland• Community extended via

– Packages• Facts

– initial interpreter ~ 1000 lines of C code– GNU project

5

R language features

• Based on S language (commercial)– Created in 1976 by Bell Labs– Main moto: "to turn ideas into software, quickly and

faithfully" (John Chambers)• Memory allocation

– Fixed allocation at startup– ‘on-the-fly’ garbage collector– memory hungry

• Variables have lexical scoping– functions have access to the variables which were in the

effect when the function was defined

6

Lexical scoping

make.counter <- function(x = 0){function(){ x <<- x + 1 cat("Inside the make.counter() the x is",x,"and not 5!\n") print(environment())}

}x <- 5cat("Global value of x is", x , "\n");print(environment())

counter <- make.counter()counter()print(x)

1) The value of x inside make.counter()is stored even after the function call and belongs to the function environment (i.e. scope)

2) The global value of x is not modified 3) The “inner function” (i.e. function()) can only access the local value of x

More details are given at the functions section of the tutorial…

The value of the variable is searched in the environment it is being called. If the value is not found, the search is continued in parent environment

7

R language features• Flexible Plot Layouts

– The layout command divides up the area into a matrix. nf <- layout(matrix(c(1,1,0,2), 2, 2, byrow=TRUE), respect=TRUE)

– Calling layout.show()results in the layout being displayed for your reference.

8

R language features

• A flexible statistical analysis toolkit– Statistical models

• Regression, ANOVA, GLM, trees

• Free data analysis– No subscription fees– Transparent code

• Is a language– Can write complex scripts– Not limited by the GUI

• Active community

9

Why to learn R?

• R is widely used by bioinformaticians and statisticians

• It is multiplatform – iOS, Windows, Linux

• Large number of libraries• Main library repositories CRAN and

BioConductor

10

In a nut-shell…

• R is a scripting language and, as such, is much more easier to learn than other compiled languages such as C

• R has reasonably well written documentation (vignettes)

• Syntax in R is simple and intuitive if one has basic statistics skills

• R scripts will be provided and explained in-class

11

Installation of R

1. go to http://www.r-project.org/2. select CRAN (Comprehensive R Archive Network)

from left menu3. link to nearby geographic site4. select your operating system5. choose "Base" installation6. save R-X.X.X-win32.exe (windows) or R-X.X.X-

mini.dmg (Mac OS X)7. run the installation program accepting defaults

12

TutorialIntroduction to R language

Part 1: Basics

13

Topics covered in this tutorial

• Operators / Variables• Main objects types• Visualization

– plot modification functions• Writing and reading data to/from files• Bioinformatics applications

– Annotation of array probes

14

Variables/Operators

• Variables store one element x <- 25

Here x variable is assigned value 25

• Check value assigned to the variable x>x

[1] 25• Basic mathematical operators :

– arithmetic: + , - , * , /– power: ^

• Use parenthesis to obtain desired sequence of mathematical operations

15

Arithmetic operators

• What is the value of small z here?>x <- 25> y <- 15

> z <- (x + y)*2> Z <- z*z

> z[1] 80

16

Logical operators

• These operators mostly work on vectors, matrices and other data types

• Type of data is not important, the same operators are used for numeric and character data types

Operator Description< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to!= not equal to!x Not xx | y x OR yx & y x AND y

17

Logical operators

• Can be applied to vectors in the following way. The return value is either True or False

> v1[1] 48 2 3 4 5

> v1 <= 3[1] FALSE TRUE TRUE FALSE FALSE

18

Key functions

• mathematical functions – sqrt, log, exp, sin, cos, tan

• simple functions– max, min, length, sum, mean, var, sort

• other useful functions– abs(-5) #absolute value– exp(8) # exponentiation– log(exp(8)) # natural logarithm– sqrt(64) # square root

19

R workspace

• Display all workplace objects (variables, vectors, etc.) via ls():

>ls()[1] "Z" "v1" "x" "y" "z"

• Useful tip: to save “workplace” and restore from a file use:

>save.image(file = " workplace.rda")>load(file = "workplace.rda")

20

How to find help info?• Any function in R has help information• To invoke help use ? Sign or help():

? function_name()? mean

help(mean, try.all.packages=T)

• To search in all packages installed in your R installation always use try.all.packages=T in help()

• To search for a key word in R documentation use help.search():

help.search("mean")

21

Basic data types

• Data could be of 3 basic data types:– numeric– character– logical

• Numeric variable type / class:> x <- 1> mode(x)

[1] "numeric"   > class(x)

[1] "numeric"

22

Basic data types

• Logical variable type (True/False):> y <- 3<4> mode(y)

[1] "logical"

• Character variable type:> z <- "Hello class"

> mode(z)[1] "character"

23

Objects/Data structures• The main data objects in R are:

– Matrices (single data type)– Data frames (supports various data types)– Lists (contain set of vectors)– Other more complex objects with slots

• S3 and S4 objects

• Matrices are 2D objects (rows/columns) > m <- matrix(0,2,3)

> m[,1] [,2] [,3][1,] 0 0 0[2,] 0 0 0

24

Vectors c()

• Vectors have only 1 dimension and represent enumerated sequence of data. They can also store variables

> v1 <- c(1, 2, 3, 4, 5)> mean(v1)

[1] 3

The elements of a vector are specified /modified with braces (e.g. [number])

> v1[1] <- 48> v1

[1] 48 2 3 4 5

25

Lists

• Lists contain various vectors. Each vector in the list can be accessed by double braces [[number]]. Lists could be of different types

> x <- c(1, 2, 3, 4)> y <- c(2, 3, 4)> L1 <- list(x, y)

> L1[[1]]

[1] 1 2 3 4[[2]]

[1] 2 3 4

26

Matrices

• Contain data of the same type– i.e. all cells are one of the 3 basic types

• Defined with matrix()– Arguments:

• initialization value (i.e. 0)• ncol: number of columns• nrow: number of rows

matrix(0,ncol=2,nrow=2) [,1] [,2][1,] 0 0[2,] 0 0

27

Data frames• Data frames are similar to matrices but

can contain various data types> x <- c(1,5,10)

> y <- c("A", "B", "C")> z <-data.frame(x,y)

x y1 1 A2 5 B3 10 C

• To get/change column and row names use colnames() and rownames()

28

Factors

• Factors are special– could be vectors of integers or characters– can identify levels

• i.e. unique variables in a vector> letters = c("A","B","C","A","C","C")

- the vector letters has 6 chars but 3 levels> letters = factor(letters)

[1] A B C A C CLevels: A B C

> summary(letters)A B C 2 1 3

- summary()allows to see - number of variables per level- unique variables (levels)

29

Conversion between data types

• One can convert one type of data into another using as.xxx where xxx is a data type

30

S4 objects• Allow to store more complex data strutures• Benefits

– Provide greater modularity– Check for data type errors during assignment– Cleaner code

• Every S4 objects belongs to a classrequire(methods)setClass("Box",

slots=c(box_name = "character", address = "numeric", files = "list“)

,prototype=list(box_name=NULL,address=NULL, files=list())

)

31

S4 objects

• Each object is an instance of a classobject = new("Box")

• To access slots in the object use @ operator object@box_name object@address object@files• Assign values to slots object@box_name="pretty_box"

32

S4 object functions

• Use functions to get and set values in objectssetMethod(f="getBoxName", signature="Box", definition=function(object){

return(object@box_name)})

setMethod(f="setBoxName", signature="Box", definition=function(object,name){

object@box_name <-name})

33

Read Input/ Write Output

• To read data into R from a text file use read.table()– read help(read.table) to learn more– scan() is a more flexible alternativeraw_data <-read.table(file="data_file.txt")

• To write data into R from a text file use write.table()

> write.table(mydata, "data_file.txt")

34

Loops

• Loops allow repetition operations– can check given condition– execute n number of times

• R is not specifically designed for loops – use them sparsely or replace by apply()– avoid nested loops

• Will look at– for loop– while loop– repeat loop

35

For loop• Executed n number of times

– safe to use– variable i will take values found in the array (i.e. 1:10)

for(i in 1:10){print(i)

}

36

while loop

• Will test infinitely for a particular condition until it becomes TRUE

i=1;while(i<=10){print(i)i=i+1

}

37

repeat• similar to the while loop• will always begin the loop

– Executed at least once• Important to have a breaking condition

– otherwise, infinite loop

x <-1repeat { x <- x+1#if(x >= 5)break; }print(x)

38

apply()

• A faster alternative to loops• Apply function to 2D data structures (matrix, df)apply(X, MARGIN, FUN, ...)X: is matrix or data frame (df)MARGIN: 1-columns and 2-rowsFUN: function to apply to rows or columnsM <- matrix(1:6, nrow=3, byrow=TRUE)

apply(M, 1, sum) #columnsapply(M, 2, sum) #rows

39

lapply()

• is applied on lists / vectors• it does not work on matrices or higher-

dimensional arrays directly• returns list()M <- matrix(1:6, nrow=3, byrow=TRUE)lapply(1:2,function(x){sum(M[,x])})

L=list(sample(1:10,10))lapply(L,mean)

40

sapply()

• the "s" in "sapply" stands for "simplify• simplifies output to a vector

M <- matrix(1:6, nrow=3, byrow=TRUE)sapply(1:2,function(x){sum(M[,x])})[1] 9 12

41

tapply()• Apply function to subset data matrixtapply(summary, group, Function)

Summary variable: selected variable to apply function on Group variable: the filter variable used to split data Function: function to apply to resulting sets

medical.data <- data.frame(patient = 1:100, age = rnorm(100, mean = 60, sd = 12), treatment = gl(2, 50, labels = c("Treatment", "Control")))

tapply(medical.data$age, medical.data$treatment, mean)Treatment Control 61.10559 59.37849

42

Conditional statements

• allow to make choices– specific parts of the code executed

• Add flexibility• Main statements

– if … else– switch

43

If..else

if (test expression) { #code

} else if (test expression) {#code

} else () {#code

}• else()does not have any test expression

– accepts all cases not covered by if()or else if()• test expression returns logical value (T or F)

44

If..elsesign <- function(x){if(x > 0){

print("x is a positive number"); } else if (x == 0){

print("x is zero"); } else{

print("x is negative") }}x<- 5; sign(x);x<- 0; sign(x);x<- -2; sign(x);

45

swich()

switch (statement, list)• the statement is evaluated and value returned• based on this value, the corresponding item in the list is returned

y <- rnorm(5)x <- "sd"z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y))print(z)sd(y)

46

TutorialIntroduction to R language

Part 2: Functions

47

Functions in R

• Functions encapsulate chucks of code– Helps with debugging and code readability– Variables passed by ‘pass by value’

add.function<-function(x){x+5

}> add.function(2)[1] 7

48

Functions in R

• defined with the function() directive• stored as R objects• can be nested• The return() directive is the last expression in

the function body to be evaluated

f <- function() { # Do something interesting return(1)

}

49

Arguments

• named arguments– optionally could have default values

• not every function call in R makes use of all arguments– i.e. function arguments could be missing

• then the default values are used

mydata <- sample(1:10,20, replace=T)sort(mydata)sort(mydata, decreasing=T)

50

Arguments

• can be matched by– relative sequential position: f(arg1, arg2, arg3)

rnorm(100,0,1)– name: f(data=arg1, replace=arg2)

rnorm(n=100, mean = 0, sd = 1)– can mix positional and by name matching

mydata=data.frame(y=rnorm(100),x=rnorm(100))lm(data = mydata, y ~ x, model = FALSE, 1:100)

#equivalentlylm(y ~ x, mydata, 1:100, model = FALSE)

51

Lazy evaluation

• Arguments to functions are evaluated lazily– i.e. R does not check the # of arguments supplied to

function– i.e. can give less but not more args to a function– args are evaluated as needed

f <- function(a,b) { a^2

}f(a=10) #no error even if b is missing

52

The … argument• The ... argument indicates

– a variable # of arguments– adds flexibility and abstraction

• Uknown number of variables to be passed to a function apriori

x=1:10; y=rnorm(10)f <- function(x, y, ...) {cat("value of x:",x,"\n");cat("value of y:",y,"\n");plot(x,y, ...)

}f(x,y)#adding extra argument type=‘l’f(x,y, type="l")

53

Scoping

• Variable value is searched through a series of environments– the global environment– the namespaces of loaded librariessearch()[1] ".GlobalEnv" "package:stats" "package:graphics"[4] "package:grDevices" "package:utils" "package:datasets"[7] "package:methods" "Autoloads" "package:base"

• The global environment– is the user’s workspace

54

Scoping

• Loading a package via library()adds the namespace of that package to position 2 of the search list

library(igraph)search() [1] ".GlobalEnv" "package:igraph" "package:stats" [4] "package:graphics" "package:grDevices" "package:utils" [7] "package:datasets" "package:methods" "Autoloads"[10] "package:base"

55

Scoping

• The scoping rules for R– different from the original S language– uses lexical scoping

• related on how R “searches” for the variable value

– simplifies statistical calculations

f <- function(x, y) {y=2x+z

}– z is a ‘free variable’ not defined in f()

56

Searching for the value

• free variable value will be searched– in the environment in which a function was called– In the parent environment– In the top-level environment (i.e. global)– In the empty environment

• if value not found– error thrown

57

Scoping• Typically, the user defined values are found

– in the workspace (i.e. global environment)– a function is defined in the global environment

• can have functions – defined inside other functions– the environment in which a function is defined

could be environment of another function• child parent environments

build.power <- function(n) {p <- function(x) {

x^n}print(p)

}square <- build.power(2)cube <- build.power(3)

58

Lexical Scoping

• With lexical scoping – the variable value is looked up in the environment in

which the function was definedy <- 10f <- function(x) {y <- 2y^2

}f()[1] 4

59

Other languages

• Lexical scoping is supported in– Perl– Python– …

60

Dynamic scoping

• With dynamic scoping– The variable value is looked up in the environment

from which the function was called (the calling environment)

y <- 10f <- function(x) {y^2

}f()[1] 100

61

Regression andGeneralized Linear Models Functions

Examples

62

Regression• Regression analysis is used to

– explain or model the relationship between • a single variable Y, called the response, output or

dependent variable, and • one or more predictor, input, independent or

explanatory variables, X1, …, Xp

– p=1 simple regression– p>1 multivariate regression

63

Main uses of regression analysis

• Prediction of future observations.• Assessment of the effect of, or relationship

between, explanatory variables on the response.

• A general description of data structure

64

Regression in R

• the basic syntax for doing regression in R – lm(Y~model)to fit linear models– glm(Y ~model)to fit generalized linear

models.• where model could be specified following

the general syntax rules in R– consists of predictor terms– provides Y~X relationship

65

Model syntax

66

Simple linear regression

• The regression of Y on X is given by

for i = 1, … , p• Unknown parameters

– intercept • point in which the line intercepts the y-axis

– slope or coefficient• Increase in Y per unit change in X

67

Objective

• To find the equation of the line that “best” fits the data.– find andsuch that the

fitted values of are as close as possible to observed

– Residuals• The difference between

– the observed value and the fitted value

68

residual sum of squares(RSS)

• A usual way of calculating and– based on the minimization of the sum of the

squared residuals, or• residual sum of squares (RSS)

=

69

Regression example

• thuesen dataset – Ventricular shortening velocity– 24 rows and 2 columns

• short.velocity (Y)• blood.glucose (X)

– It contains ventricular shortening velocity and blood glucose for type 1 diabetic patients.

70

Regression examplefile="http://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO0009-1_TopInBioinf2015-16/lectures/L2/thuesen.txt"

data <- read.table(file, header=TRUE, stringsAsFactors=FALSE)options(na.action =na.exclude)fit.lm <- lm(short.velocity ~ blood.glucose, data=data)

data.frame(data, fitted.value=fitted(fit.lm), residual=resid(fit.lm))

blood.glucose short.velocity fitted.value residual1 15.3 1.76 1.433841 0.3261585322 10.8 1.34 1.335010 0.0049898823 8.1 1.27 1.275711 -0.0057113084 19.5 1.47 1.526084 -0.0560840625 7.2 1.27 1.255945 0.014054962

71

Measuring Goodness of Fit

• Coefficient of Determination, R2

– Need to calculate TSS, RSS and SSreg• The ANOVA breaks the total variability observed

in the sample into two parts– TSS = SSreg + RSS– TSS: total sum of squares (entire sample variability)– SSreg: regression sum of squares

• Explained by fitted model

– RSS: residual sum of squares (unexplained variability)

72

Calculating R2

anova(fit.lm)Analysis of Variance Table

Response: short.velocity Df Sum Sq Mean Sq F value Pr(>F) blood.glucose 1 0.20727 0.207269 4.414 0.0479 *Residuals 21 0.98610 0.046957 ---

R2 = SSreg/TSSR2 = 0.20727/(0.98610+0.20727)R2 = 0.1736846

73

Getting lm() summarysummary(fit.lm)Call:lm(formula = short.velocity ~ blood.glucose, data = data)Residuals: Min 1Q Median 3Q Max -0.40141 -0.14760 -0.02202 0.03001 0.43490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.09781 0.11748 9.345 6.26e-09 ***blood.glucose 0.02196 0.01045 2.101 0.0479 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2167 on 21 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479

* - are shorthand for significance levelsEstimate - is the value of slope calculated by the regressionStd. Error - Measure of the variability in the estimate for the coefficient βt value - measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the significance levels.R-squared - Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best.DF - The Degrees of Freedom is the difference between the number of observations (24) and the number of variables used in your model minus 1 (intercept counts as a variable). DF = 24-2-1 = 23

74

Generalized Linear Models Functions

• The glm() – is designed to perform generalized linear models regression

on outcome data that is• binary • count • proportion• continuous

– overcomes limitation of classical regression models• do not need to transform the response Y to have a normal

distribution N(0, σ2)

– limitations• assumes linear Y~X relationship

75

glm()

• glm(formula, family = "gaussian", data, …)

– formula: symbolic description of the model to fit• e.g. y~x1+x2 or y~. or y~x1+x2+x1*x2

– family: type of distribution to apply to the response variable

– data: data frame with the variables • glm returns an object with slots• summary(obj)provides compact summary

76

logistic regression exampleUsing data on admissions where rank of 1 represents the highest and 4 the lowest prestige. Determine the major variable/factor impacting the admission decision

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")head(mydata) admit gre gpa rank1 0 380 3.61 32 1 660 3.67 3fit <- glm(admit ~ gre + gpa + rank, family=binomial(link="logit"),

data=mydata) summary(fit)Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -3.449548 1.132846 -3.045 0.00233 **gre 0.002294 0.001092 2.101 0.03564 *gpa 0.777014 0.327484 2.373 0.01766 *rank -0.560031 0.127137 -4.405 1.06e-05 ***AIC: 467.44 (the lower the better)

77

Exploring glm() objectattributes(fit)$names [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [7] "qr" "family"

"linear.predictors"[10] "deviance" "aic" "null.deviance"[13] "iter" "weights" "prior.weights"[16] "df.residual" "df.null" "y"[19] "converged" "boundary" "model"[22] "call" "formula" "terms"[25] "data" "offset" "control"[28] "method" "contrasts" "xlevels"

$class[1] "glm" "lm"

78

Accessing glm S3 object values

• Can look at beta coefficients of each variablefit$coefficients• Can check how good is my fitted modelfit$deviance fit$aic• Use TAB key for auto-complete

79

Regression in Genetics

•Response yj for j=1..n• where case(1) or control(0)• Could refer to two types of patients or phenotypes

•Genotypes gi for markers i=1..p•To run glm() need to code data– Additive– Dominant – Recesive

80

Data

• There are genotypes for – 1018 individuals at 32 SNP markers– 32 columns give the marker genotypes

• AA-11, AG-12 and GG-22, with NA for missing

– The genotypes• 890kb region flanking the CYP2D6 gene• associated with the metabolism of drugs

file="http://www.well.ox.ac.uk/rmott/LECTURES/LOGISTIC_REGRESSION/ugeno.dat"data <- read.table(file, header=TRUE, stringsAsFactors=FALSE)

81

Coding• Additive coding

Code genotypes as – AA(11) x=0, – AG(12) x=1, – GG(22) x=2

• Recessive codingCode genotypes as – AA x=0, – AG x=0, – GG x=1

• Dominant codingCode genotypes as – AA x=0, – AG x=1, – GG x=1

additive <- function( x ) { return(as.numeric(factor(x))-1)}

recessive <- function( x ) { return ( ifelse( additive(x) > 1, 1, 0 ) ) }

dominant <- function( x ) { return ( ifelse( additive(x) > 0, 1, 0 ) )}

these functions allow to convert a genotype vector in a certain way (i.e. coding)

82

Fitting the additive model

• Additive codingx=apply(data[,-c(1:2)],2,additive)data_prep=as.data.frame(cbind(y=data$y,x))fit.add <- glm(y~m1, data=data_prep, family = 'binomial' )Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -2.8695 -0.9872

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 335.1 AIC: 339.1

83

additive model example

84

Fitting the dominant model

• Dominant codingx=apply(data[,-c(1:2)],2,dominant)data_prep=as.data.frame(cbind(y=data$y,x))fit.dom <- glm(y~m1, data=data_prep, family = 'binomial' )

Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -2.880 -1.004

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 336.2 AIC: 340.2

85

Dominant model example

86

Fitting the recessive model

• Recessive codingx=apply(data[,-c(1:2)],2,recessive)data_prep=as.data.frame(cbind(y=data$y,x))fit.recessive <- glm(y~m1, data=data_prep, family = 'binomial' )Call: glm(formula = y ~ m1, family = "binomial", data = data_prep)

Coefficients:(Intercept) m1 -3.126 -15.440

Degrees of Freedom: 1017 Total (i.e. Null); 1016 ResidualNull Deviance: 343.7 Residual Deviance: 340.1 AIC: 344.1

87

Recessive model example

88

In-class exercisePart 1 of 2

89

In-class Exercises

1) Create vectors a=(5, 6, 7) and b=(10,3,1) and obtain a total sum of their elements that are greater than 5

2) Calculate 3) Create matrix A and replace the element

located in the 3rd row and 3rd column by the sum of the 1st and 2nd row

90

In-class Exercises

4) Write a function which takes a single argument which is a matrix A (see Q3). The function should return a matrix which is the same as the function argument but every odd number is doubled.

91

Plots generation in R

• R provides very rich set of plotting possibilities

• The basic command is plot()• Each library has its own version of plot()

function• When R plots graphics it opens

“graphical device” that could be either a window or a file

92

Plotting functions

• R offers following array of plotting functions

Function Descriptionplot(x) plot of the values of x variable on the y axis

plot(x,y)bi-variable plot of x and y values (both axis scaled based on values of x and y variables)

pie(y) circular pie-charboxplot(x) Plots a box plot showing variables via their quantileshist(x) Plots a histogram(bar plot)

93

Plot modification functions

• Often R plots are not optimal at 1st• R has an array of graphical parameters

Consult here is the full list• Some of the graphical parameters can be

specified inside plot() or using other graphical functions such as lines()

94

Plot modification functionsFunction Descriptionpoints(x,y) add points to the plot using coordinates specified in x and y vectorslines(x,y) adds a line using coordinates in x and y

mtext(text,side=3) adds text to a given margin specified by side number

boxplot(x) this a histogram that bins values of x into categories represented as barsarrows(x0,y0,x1,y1, angle=30, code=1)

adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn

abline(h=y) draws horizontal line at y coordinaterect(x1, y1, x2, y2) draws rectangle at x1, y1, x2, y2 coordinates

legend(x,y)plots legend of the plot at the position specified by x and y vectors used to generate a given plot

title() adds title to the plot

axis(side, vect)adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn

locator() used interactively to select locations with mouse

95

Plot margins

• Outer margin– par(oma=c(3, 2, 2,1))

•Fig. margin– par(mar=c(5, 4, 4, 2))

•Note the num. order– down, left, up, right

96

Visualization of data in Rcars <- c(1, 3, 6, 4, 9) trucks <- c(2, 5, 4, 5, 12) g_range <- range(0, cars, trucks)

plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE)

lines(trucks, type="o", pch=22, lty=2, col="red")axis(2, las=1, at=4*0:g_range[2])axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri")) box()

title(xlab="Days", col.lab=rgb(0,0.5,0)) title(ylab="Total", col.lab=rgb(0,0.5,0))

97

Visualization of data in Rc1 <- c(1,3,6,4,9); c2 <- c(2,5,4,5,12); c3 <- c(4,4,6,6,16);autos_data <- cbind(c1,c2,c3); colnames(autos_data)<-c("cars", "trucks", "suvs");

barplot(as.matrix(autos_data), main="Autos", ylab= "Total", beside=TRUE, col=rainbow(5)) legend("topleft", c("Mon","Tue","Wed","Thu","Fri"), cex=0.8, bty="n", fill=rainbow(5))

98

Visualization of data in R

#Expand right side of the margin for the legend par(xpd=T, mar=par()$mar+c(0,0,0,4))

#Graph autos using heat colors #put 10% of the space between each bar, and make #labels smaller with horizontal y-axis labels barplot(t(autos_data), main="Autos", ylab="Total", col=heat.colors(3), space=0.1, cex.axis=0.8, las=1, names.arg=c("Mon","Tue","Wed","Thu","Fri"), cex=0.8)

legend(6, 30, colnames(autos_data), cex=0.8, fill=heat.colors(3));

# Restore default margins par(mar=c(5, 4, 4, 2) + 0.1);

99

Visualization of data in R

dotchart(t(autos_data), color=c("red","blue","darkgreen"), main="Dotchart for Autos", cex=0.8)

100

Visualization of data in R

r <- rlnorm(1000); hist(r)

101

Visualization of data in Rr <- rlnorm(1000) # Get the distributionh <- hist(r, plot=F, breaks=c(seq(0,max(r)+1, .1))) # Plot the distribution using log scale on both axes, and # use blue points plot(h$counts[h$counts > 0], log="xy", pch=20, col="blue", main="Log-normal distribution", xlab="Value", ylab="Frequency")

102

In-class exercisesPart 2 of 2

103

In-class exercises

1) create two perfectly correlated vectors (with ρ=1) and plot them.

2) Change data points to a dashed line3) Add a horizontal and vertical red lines to your plot at

coordinates: a) (1,10) and (4,4); b) (7,7) and (1,20)4) Add text to the plot with text() to the left and right

of the diagonal line5) Add title to your plot6) Add legend to the plot

104

Installation of new libraries

• There are two main R repositories– CRAN– BioConductor

• To install package/library from CRANinstall.packages("seqinr")

To install packages from BioConductorsource("http://bioconductor.org/biocLite.R")

biocLite("GenomicRanges")

105

The Comprehensive R Archive Network (CRAN)

• CRAN – package repository features 7154 available

packages– install.packages("packageName")– Some popular packages

• ggplot2 – beautiful plots• Matrix – sparse matrices• igraph – graphs and analysis thereof

106

source("http://bioconductor.org/biocLite.R");biocLite("packageName")

• Repository of biology-related libraries in R– 178,856 packages

• Some list of libraries– biomaRt– IRanges

107

Installation of the new libraries

• Download and install latest R version on your PC. Go to http://cran.r-project.org/

• Install following libraries by runninginstall.packages(c("seqinr", "ape", "GenABEL")

source("http://bioconductor.org/biocLite.R")biocLite(c("limma", "muscle",

"affy","hgu133plus2.db","Biostings"))

108

Biological annotation with biomaRt

• Biological experiments require ID conversions– e.g. microarray probe id to gene name– e.g. mapping of SNP to gene symbol

• BioMart online service– can be accessed via web-GUI– programmatically via biomaRt

109

BioMart• Go to http://central.biomart.org/converter/#!/ID_converter/

• Select genome assembly version• Paste or upload the ID list• E.g. covert SOX1 to GOID

110

biomaRt

• Access BioMart programmatically– install the biomaRt librarysource("http://www.bioconductor.org/biocLite.R"); biocLite("biomaRt"); require(biomaRt);

- use the listMarts() to see the different databaseslistMarts()- we will use the ensembl martensMart<-useMart("ensembl")

111

biomaRt- listDatasets() to see which data sets that are

available in the databaselistDatasets(ensMart)

- Will use ʻhomo sapiensʼ datasetensembl_hs_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

- List retrieved attributes (db fields)listAttributes(ensembl_hs_mart)[1:100,]

- Download dataensembl_df <- getBM( attributes=c("ensembl_gene_id", "ensembl_transcript_id", "hgnc_symbol","chromosome_name", "entrezgene"), mart=ensembl_hs_mart )

112

biomaRt- List of genes to annotatemy_genes = c("ENSG00000197971", "ENSG00000153165", "ENSG00000159352", "ENSG00000146006", "ENSG00000149809", "ENSG00000204179", "ENSG00000213023", "ENSG00000115008", "ENSG00000130844","ENSG00000155363")

- Annotate IDsmy_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),]

ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name entrezgene66435 ENSG00000197971 ENST00000382582 MBP 18 415566038 ENSG00000153165 ENST00000409886 RGPD3 2 653489190545 ENSG00000159352 ENST00000368884 PSMD4 1 5710

113

In-class exercise

1) Map SNP rs2066844 to a gene with biomaRt library. What is its gene symbol, full gene name and gene biological function?

114

References

[1] Team, R. Core. "R Language Definition." (2000).[2] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440.

115

116

In-class exercises (answers) s.103x=seq(1,10)y=2*seq(1,10)lines(c(1,10),c(4,4))lines(c(7,7),c(1,20))text(5, 8, "This is my plot", adj = c(0,0))plot(x,y, type="c")legend(9,15, "data" , lty=1, col=c('red'), cex=0.8)

117

In-class exercises (answers). s113library(biomaRt)snp_db=useMart("snp", dataset="hsapiens_snp")

getBM(attributes="associated_gene", filters = "snp_filter", values = "rs2066844", mart=snp_db)

Recommended