Introduction to R - Massachusetts Institute of Technologybarc.wi.mit.edu/education/hot_topics/Intro_to_R_2016/... · 2010. 7. 11. · – Write commands in R and then paste into

Introduction to R: Using R for Statistics and

Data Analysis

BaRC Hot Topics

http://barc.wi.mit.edu/hot_topics/

Why use R?

• Perform inferential statistics (e.g., use a statistical test to calculate a p-value)

• Analyze numbers (vectors and matrices)

• Create custom figures

• Automate analysis routines (and make them more reproducible) – RStudio: R markdown (Rmd) and knit (html)

• Reduce copying and pasting – Some Unix commands may be easier – ask us!

• Use up-to-date analysis algorithms

• Real statisticians use it, and it’s free!

2 2

Why not use R?

• A spreadsheet application already works fine

• You’re already using another statistics package – Ex: Prism, MatLab

• It’s hard to use at first – You have to know what commands/syntax to use

• You don’t know how to get started – Irrelevant if you’re here today

3 3

About R

• Originally written by Ross Ihaka and Robert Gentleman • Written in mostly C • R is accessible from other languages (Python/Perl) • Packages – many statisticians/developers have extended R

by creating packages (libraries) containing a set of commands/code, data, documentation in a well-defined format: Comprehensive R Archive Network (CRAN) cran.r-project.org BioConductor bioconductor.org Install package: install.packages("packageName")

4

About R: object-oriented (OO) systems

• Class

• Object

• Method

5

Getting Started: Running R

1. Using Rstudio (tak.wi.mit.edu/rstudio) – Enter your tak username and password

2. On the command line (on tak) – Log into tak ssh USERNAME@tak -Y – Start R by typing R

3. On your own computer – Go to R (http://www.r-project.org/) – Download “base” from CRAN and install it on your computer – Open/install the program – Install RStudio (optional)

6 6

Start of an R session

7

RStudio Interface

8 rstudio.org

Console

Editor; R Script

Workspace; History

Output: Plots

Good practices… • Save all useful commands and rationale

– Add comments (starting with “#”)

• Reproducibility - several approaches:

– Write commands in R and then paste into a text file, or

• By convention, we end files of R commands with “.R”

• Use a specific name for file (ex: compare_WT_KO_weights.R)

– Start file with #!/usr/bin/Rscript to make it a R script

– In RStudio, save the file as R markdown (.Rmd) and knit (.html)

• Use the up-arrow to get to previous command

– Minimize typing, as this increases potential errors.

• To clear your R window, use Ctrl-L

• Names of variables should not begin with numbers or underscore, use letters (case sensitive)

– uppercase + lowercase helps (myWTmice)

– can include dots (my.WT.mice)

9 9

Getting help

• Use the Help menu • Check out “Manuals”

– http://www.r-project.org/ – contributed documentation

• Use R’s help and example – ?median [show info for cmd] – ??median [search docs] – example(median)

• Search the web – “r-project median”

• Our favorite book: – Introductory Statistics with R

(Peter Dalgard)

10

html help

Useful Commands

• dir() #list the files in the directory • getwd() #get working dir • setwd ("/nfs/BaRC_Public") #change working dir • history(n) #command history (default n=25)

savehistory(file="myfile.txt") #saves as .Rhistory

• ls() #list objects • sessionInfo() #print session info about R and loaded packages

(useful for knowing version) • quit() # quit R

Save workspace (.RData file)

• Assigning value: = or <-

– X <- 2

– X = 2

11

Data Types: Mode

• Mode: components of the same type Commonly used*:

– character (eg. "dog", "A")

– integer (eg. 3)

– numeric (eg. 3.15)

– logical (eg. TRUE, FALSE)

• Useful commands: – is.numeric(x) #returns if x is TRUE or FALSE

– as.numeric(x) #converts x to numeric

12 *incomplete

Data Types/Structure

• Vector: one-dimensional, most commonly used data structure of the same type chr_vector<-c("DNA", "RNA","Protein")

• List: one-dimensional, elements can be of any type x_list<-list(1,2,"a",c(TRUE,FALSE))

• Matrix: two-dimensional, elements of the same type x_matrix<-matrix(1:9, ncol=3,nrow=3)

• Data frame: two-dimensional, similar to a matrix but elements can be of different type df<-data.frame(x=1:4,y=c("a","b","c","d"))

13

Commands to Explore Data

head(object) # see the top of your data

length(object) # length of a vector

dim(object) # dimensions of a matrix or data frame

names(object) # (get or set) names of an object

mode(object) # check if it’s numeric, character, etc.

class(object) #class or type of an object

str(object) #structure of an object

14

Simple Workflow for a t-test

# Number of tumors (from litter 2 on 11 July 2010)

wt = c(5, 6, 7)

ko = c(8, 9, 11))

# Try default t-test settings (Welch's 2-sample t-test)

t.test(wt, ko)

# Do standard 2-sample t-test

t.test(wt, ko, var.equal=T)

# Save the results as a variable

wt.vs.ko = t.test(wt, ko, var.equal=T)

# What are the different parts of this data frame?

names(wt.vs.ko)

# Just print the p-value

wt.vs.ko$p.value

# What commands did we use?

history(max.show=Inf)

15 15

• Take R to your preferred directory

• On R GUI, go to File > Change dir…

• Check where you are (e.g., get your working directory)

and see what files are there > getwd()

[1] "X:/bell/Hot_Topics/Intro_to_R"

> dir()

[1] "compare_WT_KO_weights.R"

Reading Files

16 16

• Usually it’s easiest to read data from a file

– Organize in Excel with one-word column names

– Save as tab-delimited text

• Check that file is there dir()

• Read file tumors = read.delim("tumors_wt_ko.txt", header=T)

(Other Options: row.names=T, check.names=F)

• Check that it’s OK

Reading Data Files

17

> tumors

wt ko

1 5 8

2 6 9

3 7 11 17

Accessing Data

> tumors$wt # Use the column name

[1] 5 6 7

> tumors[1:3,1] # [rows, columns]

[1] 5 6 7

> tumors[,1] # missing row or column => all

[1] 5 6 7

> tumors[1:2,1:2] # select a submatrix

wt ko

1 5 8

2 6 9

> t.test(tumors$wt, tumors$ko) # t-test as before

18

18

> tumors

wt ko

1 5 8

2 6 9

3 7 11

Creating an output table

• Most analyses involve several outputs

• You may want to create a matrix to hold it all

• Create an empty matrix

– name rows and columns

pvals.out = matrix(data=NA, ncol=2, nrow=2)

colnames(pvals.out) = c(“two.tail", “one.tail")

rownames(pvals.out) = c("Welch", "Wilcoxon")

pvals.out two.tail one.tail

Welch NA NA

Wilcoxon NA NA

19 19

Filling the output table (matrix)

• Do the stats # Welch’s test (t-test with pooled variance)

pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value

pvals.out[1,2] = t.test(tumors$wt, tumors$ko,

alt="less")$p.value

# Wilcoxon rank sum test (non-parametric alternative to t-

test)

pvals.out[2,1] = wilcox.test(tumors$wt, tumors$ko)$p.value

pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko,

alt="less")$p.value

pvals.out

two.tail one.tail

Welch 0.04191452 0.02095726

Wilcoxon 0.10000000 0.05000000

20 20

Printing the output table

• We may want to round the p-values pvals.out.rounded = signif(pvals.out, 3)

• Print the matrix (table) write.table(pvals.out.rounded,

file="Tumor_pvals.txt", quote=F, sep="\t")

• Warning: output column names are shifted by 1 when read in Excel

21 21

Running a series of commands

1. Copy and paste commands into R session, or 2. Run a script in R, or

source("compare_WT_KO_weights.R")

[but not so useful in this case, since we aren’t creating any files]

3. Run a script from Unix terminal i) method 1 – Print output on screen R --vanilla < compare_WT_KO_weights.R

– Print output in file R --vanilla < compare_WT_KO_weights.R > R_out.txt

ii) method 2 ./compare_WT_KO_weights.R but script must have this as the first line: #!/usr/bin/Rscript

22 22

User-defined functions • Easily perform repetitive task, useful within a script. • Syntax myFunction <- function (arg1, arg2,…){

body

return(object)

}

• Example #Given a vector, calculate percent for each element

calcPercent <- function(x){

percent<- 100*x/sum(x)

return(percent)

}

x<-c(7,5,8,10) calcPercent(x) [1] 23.33333 16.66667 26.66667 33.33333

23

Introduction to figures

• R is very powerful and very flexible with its figure generation

• Any aspect of a figure should be modifiable

• Some figures aren’t available in spreadsheets

• Boxplot example

boxplot(tumors) # Simplest case

# Add some more details

boxplot(tumors, col=c("gray", "red"), main="MFG

appears to be a tumor suppressor", ylab="number of

tumors")

24 24

Boxplot description

IQR

75th percentile

median

25th percentile

<= 1.5 x IQR

Any points

beyond the

whiskers are

defined as

“outliers” Right-click to

save figure

(for R GUI)

25

RStudio

25

Figure formats and sizes

• By default, figures on tak are saved as “Rplots.pdf”

• Helpful figure names can be included in code

• To select name and size (in inches) of pdf file pdf(“tumor_boxplot.pdf”, w=11, h=8.5)

boxplot(tumors) # can have >1 page

dev.off() # tell R that we’re done with this figure

• To create another format (with size in pixels) png(“tumor_boxplot.png”, w=1800, h=1200)

boxplot(tumors)

dev.off()

26 26

Useful Packages I: tidyverse

• Originally written by Hadley Wickham

• Packages for data analysis

– dplyr

– tidyr

– readr

– ggplot2

– others…

27

Tidyverse: readr and tidyr

• readr – read in tabular data – read_tsv or read_csv data<-read_tsv("normalizedCounts_subset.txt")

• tidyr – "tidy" your data messy_data<-data.frame(

gene=c("GAPDH", "ACTA1","ZEB1"),

heart=c(100,140,450),liver=c(241,10,20)

)

gene heart liver 1 GAPDH 100 241 2 ACTA1 140 10 3 ZEB1 450 20

tidy_data<-gather(messy_data,tissue,count,heart:liver)

gene tissue count 1 GAPDH heart 100 2 ACTA1 heart 140 3 ZEB1 heart 450 4 GAPDH liver 241 5 ACTA1 liver 10 6 ZEB1 liver 20

28

Tidyverse: dplyr

29

Verb Description

filter Select rows based on criteria

select Select columns by name

arrange Reorder (rows)

mutate Create new column

summarise Summarise values

group_by Group operations

Useful Packages II: BioConductor

• Packages/tools for analysis of high-throughput genomic data

• To see all packages: library() • For searchable listing of packages: www.bioconductor.org/packages/release/BiocViews.html

• All require the package to be installed AND explicitly called (or in Rstudio, checked), for example, library(limma)

• Install what you need on your computer or, for tak, ask the IT group to install packages via

http://tak.wi.mit.edu/trac/newticket 30

30

Sampling of Popular BioConductor Packages*

• limma

• biomaRt

• Rsamtools

• edgeR

• DESeq/DESeq2

• affy

• topGO

• ShortRead

31

*Top75 listing from http://www.bioconductor.org/packages/stats/ based on download

http://www.bioconductor.org/packages/stats/



Other Useful Commands

32

library() mean() round(x, n)

dir() median() min()

length() sd() max()

dim() rbind() paste()

nrow() cbind() x[x>0]

ncol() sort() x[c(1,3,5)]

unique() rev() seq(from, to, by)

t() log(x, base) commandArgs()

32

R Resources

• R/Bioconductor short course – http://jura.wi.mit.edu/bio/education/hot_topics/

• R scripts (and commands) for Bioinformatics – http://iona.wi.mit.edu/bio/bioinfo/Rscripts/

• We’re glad to share commands and/or scripts to get you started

• BaRC scripts in \\BaRC_Public\BaRC_code\R • Rstudio Cheat Sheets:

https://www.rstudio.com/resources/cheatsheets/

33 33

Documents

Introduction to R - Massachusetts Institute of Technologybarc.wi.mit.edu/education/hot_topics/Intro_to_R_2016/... · 2010. 7. 11. · – Write commands in R and then paste into