Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis

Preview:

Citation preview

Introduction to the R language

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis

Websites• R www.r-project.org

– software; – documentation; – RNews.

• Bioconductor www.bioconductor.org– software, data, and documentation; – training materials from short courses;– www.bioconductor.org/workshops/UCSC03/uc

sc03.html– mailing list.

R language: Overview• Open source and open development.• Design and deployment of portable, extensible,

and scalable software.• Interoperability with other languages: C, XML.• Variety of statistical and numerical methods.• High quality visualization and graphics tools.• Effective, extensible user interface.• Innovative tools for producing documentation

and training materials: vignettes.• Supports the creation, testing, and distribution of

software and data modules: packages.

R user interface• Batch or command line processing

bash$ R to startR> q() to quit

• Graphics windows> X11()> postscript()> dev.off()

• File path is relative to working directory> getwd()> setwd()

• Load a package library with library()• GUIs, tcltk

Getting Helpo Details about a specific command whose name you know (input arguments, options, algorithm):

> ? t.test > help(t.test)

o See an example of usage:

> demo(graphics)> example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.1)) [1] 8.75 5.50

Getting Helpo HTML search engine lets you search for topics

with regular expressions:

> help.search

o Find commands containing a regular expression or object name:

> apropos("var") [1] "var.na" ".__M__varLabels:Biobase" [3] "varLabels" "var.test" [5] "varimax" "all.vars" [7] "var" "variable.names" [9] "variable.names.default" "variable.names.lm"

Getting Helpo Vignettes contain text and executable code:

> library(tkWidgets)> vExplorer()> openVignette()

Created using the Sweave() function..Rnw files produce a PDF file and a vignette.

o To see code for a function, type the name with no parentheses or arguments:

> plot

R as a Calculator

> log2(32)

[1] 5

> print(sqrt(2))

[1] 1.414214

> pi

[1] 3.141593

> seq(0, 5, length=6)

[1] 0 1 2 3 4 5

> 1+1:10

[1] 2 3 4 5 6 7 8 9 10 11

R as a Graphics Tool

> plot(sin(seq(0, 2*pi, length=100)))

0 20 40 60 80 100

-1.0

-0.5

0.0

0.5

1.0

Index

sin(seq(0, 2 * pi, length = 100))

> a <- 49> sqrt(a)[1] 7

> b <- "The dog ate my homework"> sub("dog","cat",b)[1] "The cat ate my homework"

> c <- (1+1==3)> c[1] FALSE> as.character(b)[1] "FALSE"

numeric

character string

logical

Variables

Missing Values

Variables of each data type (numeric, character, logical) can also take the value NA: not available. o NA is not the same as 0o NA is not the same as “”o NA is not the same as FALSEo NA is not the same as NULL

Operations that involve NA may or may not produce NA:

> NA==1[1] NA> 1+NA[1] NA> max(c(NA, 4, 7))[1] NA> max(c(NA, 4, 7), na.rm=T)[1] 7

> NA | TRUE[1] TRUE> NA & TRUE[1] NA

Vectorsvector: an ordered collection of data of the same type> a <- c(1,2,3)> a*2[1] 2 4 6

Example: the mean spot intensities of all 15488 spots on a microarray is a numeric vector

In R, a single number is the special case of a vector with 1 element.

Other vector types: character strings, logical

Matrices and Arrays

matrix: rectangular table of data of the same type

Example: the expression values for 10000 genes for 30 tissue biopsies is a numeric matrix with 10000 rows and 30 columns.

array: 3-,4-,..dimensional matrix

Example: the red and green foreground and background values for 20000 spots on 120 arrays is a 4 x 20000 x 120 (3D) array.

Listslist: ordered collection of data of arbitrary types.

Example:> doe <- list(name="john",age=28,married=F)> doe$name[1] "john“> doe$age[1] 28> doe[[3]][1] FALSE

Typically, vector elements are accessed by their index (an integer) and list elements by $name (a character string). But both types support both access methods. Slots are accessed by @name.

Data Frames

data frame: rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types.Represents the typical data table that researchers come up with – like a spreadsheet.

Example:> a <-data.frame(localization,tumorsize,progress,row.names=patients)> a localization tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE

What type is my data?class Class from which object inherits

(vector, matrix, function, logical, list, … )

mode Numeric, character, logical, …storage.mode

typeofMode used by R to store object (double, integer, character, logical, …)

is.function Logical (TRUE if function)is.na Logical (TRUE if missing)names Names associated with objectdimnames Names for each dim of arrayslotNames Names of slots of BioC objectsattributes Names, class, etc.

SubsettingIndividual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name

> a localization tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0

> a[3, 2][1] 10

> a["XX987", "tumorsize"][1] 10

> a["XX987",] localization tumorsize progressXX987 proximal 10 0

>a localization tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0

> a[c(1,3),] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

> a[-c(1,2),]localization tumorsize progressXX987 proximal 10.0 0

> a[c(T,F,T),] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

> a$localization[1] "proximal" "distal" "proximal"

> a$localization=="proximal"[1] TRUE FALSE TRUE

> a[ a$localization=="proximal", ] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

subset rows by a vector of indices

subset rows by a logical vector

subset columns

comparison resulting in logical vector

subset the selected rows

Example:

Functions and Operators

Functions do things with data“Input”: function arguments (0,1,2,…)“Output”: function result (exactly one)

Example:add <- function(a,b) {

result <- a+b return(result) }

Operators: Short-cut writing for frequently used functions of one or two arguments.

Frequently used operators

<- Assign

+ Sum

- Difference

* Multiplication

/ Division

^ Exponent

%% Mod

%*% Dot product

%/% Integer division

%in% Subset

| Or

& And

< Less

> Greater

<= Less or =

>= Greater or =

! Not

!= Not equal

== Is equal

Frequently used functions

c Concatenate

cbind,rbind

Concatenate vectors

min Minimum

max Maximum

length # values

dim # rows, cols

floor Max integer in

which TRUE indices

table Counts

summary Generic stats

Sort, order, rank

Sort, order, rank a vector

print Show value

cat Print as char

paste c() as char

round Round

apply Repeat over rows, cols

Statistical functionsrnorm, dnorm, pnorm, qnorm

Normal distribution random sample, density, cdf and quantiles

lm, glm, anova Model fitting

loess, lowess Smooth curve fitting

sample Resampling (bootstrap, permutation)

.Random.seed Random number generation

mean, median Location statistics

var, cor, cov, mad, range

Scale statistics

svd, qr, chol, eigen

Linear algebra

Graphical functionsplot Generic plot eg: scatter

points Add points

lines, abline Add lines

text, mtext Add text

legend Add a legend

axis Add axes

box Add box around all axes

par Plotting parameters (lots!)

colors, palette Use colors

Branching

if (logical expression) { statements} else { alternative statements}

else branch is optional{ } are optional with one statement

ifelse (logical expression, yes statement, no statement)

Loops

When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc.

for(i in 1:10) { print(i*i)}

i<-1while(i<=10) { print(i*i) i<-i+sqrt(i)}

Also: repeat, break, next

Regular ExpressionsTools for text matching and replacement which are available in similar forms in many programming languages (Perl, Unix shells, Java)

> a <- c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17")

> grep("L", a)[1] 2 3 5

> grep("L", a, value=T)[1] "Ly-9" "MLN50" "CLH-17"

> grep("^L", a, value=T)[1] "Ly-9"

> grep("[0-9]", a, value=T)[1] "Ly-9" "MLN50" "ZNF191" "CLH-17"

> gsub("[0-9]", "X", a)[1] "CENP-F" "Ly-X" "MLNXX" "ZNFXXX" "CLH-XX"

Storing Data

Every R object can be stored into and restored from a file with the commands“save” and “load”.

This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MS-Windows, Unix, Mac.

> save(x, file=“x.Rdata”)> load(“x.Rdata”)

Importing and Exporting Data

There are many ways to get data in and out.

Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab-delimited text files.

> x <- read.delim(“filename.txt”)

Also: read.table, read.csv, scan

> write.table(x, file=“x.txt”, sep=“\t”)

Also: write.matrix, write

Importing Data: caveats

Type conversions: by default, the read functions try to guess and auto convert the data types of the different columns (e.g. number, factor, character). There are options as.is and colClasses to control this.

Special characters: the delimiter character (space, comma, tabulator) and the end-of-line character cannot be part of a data field. To circumvent this, text may be “quoted”. However, if this option is used (the default), then the quote characters themselves cannot be part of a data field. Except if they themselves are within quotes…Understand the conventions your input files use and set the quote options accordingly.

Bioconductor• R software project for the analysis and

comprehension of biomedical and genomic data.– Gene expression arrays (cDNA, Affymetrix)– Pathway graphs– Genome sequence data

• Started in 2001 by Robert Gentleman, Dana Farber Cancer Institute.

• About 25 core developers, at various institutions in the US and Europe.

• Tools for integrating biological metadata from the web (annotation, literature) in the analysis of experimental metadata.

• End-user and developer packages.

Example R/BioC Packages

methods Class/method tools

tools

tkWidgets

Sweave(),gui tools

marrayTools, marrayPlots

Spotted cDNA array analysis

affy Affymetrix array analysis

annotate Link microarray data to metadata on the web

mva, cluster, clust, class

Clustering and classification

t.test, prop.test, wilcox.test

Statistical tests

Recommended