Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Introduction to R: Using R for Statistics and
Data Analysis
BaRC Hot Topics
http://barc.wi.mit.edu/hot_topics/
Why use R?
• Perform inferential statistics (e.g., use a statistical test to calculate a p-value)
• Analyze numbers (vectors and matrices)
• Create custom figures
• Automate analysis routines (and make them more reproducible) – RStudio: R markdown (Rmd) and knit (html)
• Reduce copying and pasting – Some Unix commands may be easier – ask us!
• Use up-to-date analysis algorithms
• Real statisticians use it, and it’s free!
2 2
Why not use R?
• A spreadsheet application already works fine
• You’re already using another statistics package – Ex: Prism, MatLab
• It’s hard to use at first – You have to know what commands/syntax to use
• You don’t know how to get started – Irrelevant if you’re here today
3 3
About R
• Originally written by Ross Ihaka and Robert Gentleman • Written in mostly C • R is accessible from other languages (Python/Perl) • Packages – many statisticians/developers have extended R
by creating packages (libraries) containing a set of commands/code, data, documentation in a well-defined format: Comprehensive R Archive Network (CRAN) cran.r-project.org BioConductor bioconductor.org Install package: install.packages("packageName")
4
About R: object-oriented (OO) systems
• Class
• Object
• Method
5
Getting Started: Running R
1. Using Rstudio (tak.wi.mit.edu/rstudio) – Enter your tak username and password
2. On the command line (on tak) – Log into tak ssh USERNAME@tak -Y – Start R by typing R
3. On your own computer – Go to R (http://www.r-project.org/) – Download “base” from CRAN and install it on your computer – Open/install the program – Install RStudio (optional)
6 6
Start of an R session
7
RStudio Interface
8 rstudio.org
Console
Editor; R Script
Workspace; History
Output: Plots
Good practices… • Save all useful commands and rationale
– Add comments (starting with “#”)
• Reproducibility - several approaches:
– Write commands in R and then paste into a text file, or
• By convention, we end files of R commands with “.R”
• Use a specific name for file (ex: compare_WT_KO_weights.R)
– Start file with #!/usr/bin/Rscript to make it a R script
– In RStudio, save the file as R markdown (.Rmd) and knit (.html)
• Use the up-arrow to get to previous command
– Minimize typing, as this increases potential errors.
• To clear your R window, use Ctrl-L
• Names of variables should not begin with numbers or underscore, use letters (case sensitive)
– uppercase + lowercase helps (myWTmice)
– can include dots (my.WT.mice)
9 9
Getting help
• Use the Help menu • Check out “Manuals”
– http://www.r-project.org/ – contributed documentation
• Use R’s help and example – ?median [show info for cmd] – ??median [search docs] – example(median)
• Search the web – “r-project median”
• Our favorite book: – Introductory Statistics with R
(Peter Dalgard)
10
html help
Useful Commands
• dir() #list the files in the directory • getwd() #get working dir • setwd ("/nfs/BaRC_Public") #change working dir • history(n) #command history (default n=25)
savehistory(file="myfile.txt") #saves as .Rhistory
• ls() #list objects • sessionInfo() #print session info about R and loaded packages
(useful for knowing version) • quit() # quit R
Save workspace (.RData file)
• Assigning value: = or <-
– X <- 2
– X = 2
11
Data Types: Mode
• Mode: components of the same type Commonly used*:
– character (eg. "dog", "A")
– integer (eg. 3)
– numeric (eg. 3.15)
– logical (eg. TRUE, FALSE)
• Useful commands: – is.numeric(x) #returns if x is TRUE or FALSE
– as.numeric(x) #converts x to numeric
12 *incomplete
Data Types/Structure
• Vector: one-dimensional, most commonly used data structure of the same type chr_vector<-c("DNA", "RNA","Protein")
• List: one-dimensional, elements can be of any type x_list<-list(1,2,"a",c(TRUE,FALSE))
• Matrix: two-dimensional, elements of the same type x_matrix<-matrix(1:9, ncol=3,nrow=3)
• Data frame: two-dimensional, similar to a matrix but elements can be of different type df<-data.frame(x=1:4,y=c("a","b","c","d"))
13
Commands to Explore Data
head(object) # see the top of your data
length(object) # length of a vector
dim(object) # dimensions of a matrix or data frame
names(object) # (get or set) names of an object
mode(object) # check if it’s numeric, character, etc.
class(object) #class or type of an object
str(object) #structure of an object
14
Simple Workflow for a t-test
# Number of tumors (from litter 2 on 11 July 2010)
wt = c(5, 6, 7)
ko = c(8, 9, 11))
# Try default t-test settings (Welch's 2-sample t-test)
t.test(wt, ko)
# Do standard 2-sample t-test
t.test(wt, ko, var.equal=T)
# Save the results as a variable
wt.vs.ko = t.test(wt, ko, var.equal=T)
# What are the different parts of this data frame?
names(wt.vs.ko)
# Just print the p-value
wt.vs.ko$p.value
# What commands did we use?
history(max.show=Inf)
15 15
• Take R to your preferred directory
• On R GUI, go to File > Change dir…
• Check where you are (e.g., get your working directory)
and see what files are there > getwd()
[1] "X:/bell/Hot_Topics/Intro_to_R"
> dir()
[1] "compare_WT_KO_weights.R"
Reading Files
16 16
• Usually it’s easiest to read data from a file
– Organize in Excel with one-word column names
– Save as tab-delimited text
• Check that file is there dir()
• Read file tumors = read.delim("tumors_wt_ko.txt", header=T)
(Other Options: row.names=T, check.names=F)
• Check that it’s OK
Reading Data Files
17
> tumors
wt ko
1 5 8
2 6 9
3 7 11 17
Accessing Data
> tumors$wt # Use the column name
[1] 5 6 7
> tumors[1:3,1] # [rows, columns]
[1] 5 6 7
> tumors[,1] # missing row or column => all
[1] 5 6 7
> tumors[1:2,1:2] # select a submatrix
wt ko
1 5 8
2 6 9
> t.test(tumors$wt, tumors$ko) # t-test as before
18
18
> tumors
wt ko
1 5 8
2 6 9
3 7 11
Creating an output table
• Most analyses involve several outputs
• You may want to create a matrix to hold it all
• Create an empty matrix
– name rows and columns
pvals.out = matrix(data=NA, ncol=2, nrow=2)
colnames(pvals.out) = c(“two.tail", “one.tail")
rownames(pvals.out) = c("Welch", "Wilcoxon")
pvals.out two.tail one.tail
Welch NA NA
Wilcoxon NA NA
19 19
Filling the output table (matrix)
• Do the stats # Welch’s test (t-test with pooled variance)
pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value
pvals.out[1,2] = t.test(tumors$wt, tumors$ko,
alt="less")$p.value
# Wilcoxon rank sum test (non-parametric alternative to t-
test)
pvals.out[2,1] = wilcox.test(tumors$wt, tumors$ko)$p.value
pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko,
alt="less")$p.value
pvals.out
two.tail one.tail
Welch 0.04191452 0.02095726
Wilcoxon 0.10000000 0.05000000
20 20
Printing the output table
• We may want to round the p-values pvals.out.rounded = signif(pvals.out, 3)
• Print the matrix (table) write.table(pvals.out.rounded,
file="Tumor_pvals.txt", quote=F, sep="\t")
• Warning: output column names are shifted by 1 when read in Excel
21 21
Running a series of commands
1. Copy and paste commands into R session, or 2. Run a script in R, or
source("compare_WT_KO_weights.R")
[but not so useful in this case, since we aren’t creating any files]
3. Run a script from Unix terminal i) method 1 – Print output on screen R --vanilla < compare_WT_KO_weights.R
– Print output in file R --vanilla < compare_WT_KO_weights.R > R_out.txt
ii) method 2 ./compare_WT_KO_weights.R but script must have this as the first line: #!/usr/bin/Rscript
22 22
User-defined functions • Easily perform repetitive task, useful within a script. • Syntax myFunction <- function (arg1, arg2,…){
body
return(object)
}
• Example #Given a vector, calculate percent for each element
calcPercent <- function(x){
percent<- 100*x/sum(x)
return(percent)
}
x<-c(7,5,8,10) calcPercent(x) [1] 23.33333 16.66667 26.66667 33.33333
23
Introduction to figures
• R is very powerful and very flexible with its figure generation
• Any aspect of a figure should be modifiable
• Some figures aren’t available in spreadsheets
• Boxplot example
boxplot(tumors) # Simplest case
# Add some more details
boxplot(tumors, col=c("gray", "red"), main="MFG
appears to be a tumor suppressor", ylab="number of
tumors")
24 24
Boxplot description
IQR
75th percentile
median
25th percentile
<= 1.5 x IQR
Any points
beyond the
whiskers are
defined as
“outliers” Right-click to
save figure
(for R GUI)
25
RStudio
25
Figure formats and sizes
• By default, figures on tak are saved as “Rplots.pdf”
• Helpful figure names can be included in code
• To select name and size (in inches) of pdf file pdf(“tumor_boxplot.pdf”, w=11, h=8.5)
boxplot(tumors) # can have >1 page
dev.off() # tell R that we’re done with this figure
• To create another format (with size in pixels) png(“tumor_boxplot.png”, w=1800, h=1200)
boxplot(tumors)
dev.off()
26 26
Useful Packages I: tidyverse
• Originally written by Hadley Wickham
• Packages for data analysis
– dplyr
– tidyr
– readr
– ggplot2
– others…
27
Tidyverse: readr and tidyr
• readr – read in tabular data – read_tsv or read_csv data<-read_tsv("normalizedCounts_subset.txt")
• tidyr – "tidy" your data messy_data<-data.frame(
gene=c("GAPDH", "ACTA1","ZEB1"),
heart=c(100,140,450),liver=c(241,10,20)
)
gene heart liver 1 GAPDH 100 241 2 ACTA1 140 10 3 ZEB1 450 20
tidy_data<-gather(messy_data,tissue,count,heart:liver)
gene tissue count 1 GAPDH heart 100 2 ACTA1 heart 140 3 ZEB1 heart 450 4 GAPDH liver 241 5 ACTA1 liver 10 6 ZEB1 liver 20
28
Tidyverse: dplyr
29
Verb Description
filter Select rows based on criteria
select Select columns by name
arrange Reorder (rows)
mutate Create new column
summarise Summarise values
group_by Group operations
Useful Packages II: BioConductor
• Packages/tools for analysis of high-throughput genomic data
• To see all packages: library() • For searchable listing of packages: www.bioconductor.org/packages/release/BiocViews.html
• All require the package to be installed AND explicitly called (or in Rstudio, checked), for example, library(limma)
• Install what you need on your computer or, for tak, ask the IT group to install packages via
http://tak.wi.mit.edu/trac/newticket 30
30
Sampling of Popular BioConductor Packages*
• limma
• biomaRt
• Rsamtools
• edgeR
• DESeq/DESeq2
• affy
• topGO
• ShortRead
31
*Top75 listing from http://www.bioconductor.org/packages/stats/ based on download
Other Useful Commands
32
library() mean() round(x, n)
dir() median() min()
length() sd() max()
dim() rbind() paste()
nrow() cbind() x[x>0]
ncol() sort() x[c(1,3,5)]
unique() rev() seq(from, to, by)
t() log(x, base) commandArgs()
32
R Resources
• R/Bioconductor short course – http://jura.wi.mit.edu/bio/education/hot_topics/
• R scripts (and commands) for Bioinformatics – http://iona.wi.mit.edu/bio/bioinfo/Rscripts/
• We’re glad to share commands and/or scripts to get you started
• BaRC scripts in \\BaRC_Public\BaRC_code\R • Rstudio Cheat Sheets:
https://www.rstudio.com/resources/cheatsheets/
33 33