Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Basic Introduction to R and Bioconductor Aedin Culhane May 23, 2011 ([email protected])
R (www.r-project.org) is an open source interactive computer system for visualization
and analysis of statistical data. Bioconductor (www.bioconductor.org) is a project within
R. The Bioconductor project is a source of many resources relevant to data analysis for
high-throughput biology.
Installing R and Bioconductor
1. Download R from http://www.r-project.org. The primary portal for R is http://cran.r-
project.org . Many local mirrors exist and these may provide quicker downloads.
R can be download precompiled for Linux, Mac OS and Windows. If you are using
Linux, R can be installed using yum install.
To install R simply, click on download and select the version appropriate for your
operating system.
For the most users, this precompiled version of R is sufficient. However if you wish to
write an extension package for R/Bioconductor, you need to install R from source.
Please refer to the R FAQ for directions on this. Windows users will need to install unix
tools/perl/latex, please refer to the excellent guide at http://www.murdoch-
sutherland.com/Rtools/.
R Extension Packages
On http://cran.r-project.org/ or the mirror you are using. To find out about contributed
package please click on the link Contributed extension packages at the bottom of the
main download page. This will link to a page with lots of information on contributed
packages and to “CRAN Task Views”.
CRAN task views, provide over 25 categories of contributed packages which makes
installation of these packages simpler.
For example to install genetics packages type install.packages("ctv")
library(ctv)
install.views("Genetics")
Note these categories contain many packages, for example this Genetics contains 29
packages.
Installing Individual Contributed or Extension Packages
Although a team of statisticians and programmers maintain R, many independent groups
submit contributed packages. This is a very extensive resource of hundreds of programs
and will not be install by default.
R extensions can be installed easily. R extensions are stored in repositories. Select the
package repositories using Packages -> Select repositories
The R repositories are:
CRAN: basic R distribution
CRAN (extras)- Contributed R packages. There are several hundred.
Bioconductor: The Bioconductor Project produces an open source software framework
that will assist biologists and statisticians working in bioinformatics, with primary
emphasis on inference using DNA microarrays. A CRAN style R package repository is
available via http://www.bioconductor.org/.
Omegahat: The Omegahat Project for Statistical Computing provides a variety of open-
source software for statistical applications, with special emphasis on web-based software,
Java, the Java virtual machine, and distributed computing. But recently many other
plugins including are available, into RGoogleStorage, RGoogleDocs,ROpenOffice,
RAmazonDBRest, RAmazonS3, R2GoogleMaps etc. CRAN style R package repository
is available via http://www.omegahat.org/
Select one repository, it may provide a select of CRAN mirrors, select one close by. You
can now download extension packages.
There are several ways to install extension packages
1. From the GUI, select Packages -> Install Packages. Note only the packages
from the selected repositories will be shown.
2. From command line type install.packages(“packagename”)
3. Download a compressed copy of the package from the R website and install from
local zip or .tar.gz file
Alternatively one can install a package and specific the repository using the following
command within R:
install.packages(“packagename”, repos = "http://www.omegahat.org/R")
Installing Bioconductor
The recommended method to install Bioconductor is using the following script, which
can be sourced directly:
source("http://www.bioconductor.org/biocLite.R")
biocLite()
This will automatically download core bioconductor packages. Due to the number and
size of Bioconductor, many useful packages will not be installed, so these can be installed
as described above.
Using a script to simplify Installation of R and Bioconductor
Once you are familiar with R and Bioconductor, you may wish to script the above
progress so that updates to R are simpler ;-)
I use the following script, which I saved to a plain text file called bioC_install.R
### Packages for R.
install.packages("ade4")
install.packages("scatterplot3d")
install.packages("Rcurl")
install.packages("DBI")
install.packages("RMySQL")
install.packages("impute")
install.packages("fastICA")
install.packages("e1071")
### BioC Packages for R.
source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite("made4")
biocLite("hgu133plus2")
biocLite("hgu133plus2cdf")
biocLite("hgu133plus2probe")
biocLite("Heatplus")
biocLite("biomaRt")
To install these packages, I simply type command
source(“./bioC_install.R”)
and then go for coffee ;-)
R and Bioconductor are pretty “intelligent”, if you wish to install a package that has
dependencies (needs other packages) it will automatically install these.
Introduction to Bioconductor Packages
Bioconductor (http://www.bioconductor.org) has a really nice “task views” interface to
ipacakges in Bioconductor. To find out more about these click on Packages link or
click Install and this will give you a link to BioCViews
BiocViews are grouped into 3 categories,
Annotation Data
Experiment Data.
Software
Click on each of these clicks to explore the packages in Bioconductor
Bioconductor Software Packages
Software packages are sub divided into the categories:
Each contains a long list of contributed packages. For example clicking on
Bioinformatics -> PreProcessing returns a long list of packages with methods for
handling many different data types
and each package page give details, help and a link to a vignette (tutorial).
Bioconductor Annotation Data Packages
Almost 600 bioconductor packages are annotation packages. These are an incredible
value resource in bioconductor. These packages provide annotation on the genes on
microarrays.
This resource can be searched by organism, chip manufacturer, chip name etc
For example click on chip manufacturer -> AffymetrixChip or Organism -> homo
sapiens to get a list of relevant annotation packages.
Each Affymetrix array is represented by 3 annotation packages, .db, cdf and probe. For
example the human GeneChip Human Genome U133 Plus 2.0 Array has the following
hgu133plus2.db annotation data (chip hgu133plus2) assembled using data from
public repositories
hgu133plus2cdf chip description file (cdf)
hgu133plus2probe probe sequence data (probe).
The latter two are compiled from Affymetrix or manufacturer information
To install an annotation package, follow the same guidelines as package installation
source("http://www.bioconductor.org/biocLite.R")
biocLite("hgu133plus2.db")
To find out about a package:
> library(hgu133plus2.db)
> hgu133plus2()
> hgu133plus2()
Quality control information for hgu133plus2:
This package has the following mappings:
hgu133plus2ACCNUM has 54675 mapped keys (of 54675 keys)
hgu133plus2ALIAS2PROBE has 73685 mapped keys (of 110538 keys)
hgu133plus2CHR has 40772 mapped keys (of 54675 keys)
hgu133plus2CHRLENGTHS has 93 mapped keys (of 93 keys)
hgu133plus2CHRLOC has 39212 mapped keys (of 54675 keys)
hgu133plus2CHRLOCEND has 39212 mapped keys (of 54675 keys)
hgu133plus2ENSEMBL has 37294 mapped keys (of 54675 keys)
hgu133plus2ENSEMBL2PROBE has 18042 mapped keys (of 19887 keys)
hgu133plus2ENTREZID has 40801 mapped keys (of 54675 keys)
hgu133plus2ENZYME has 4616 mapped keys (of 54675 keys)
hgu133plus2ENZYME2PROBE has 928 mapped keys (of 936 keys)
hgu133plus2GENENAME has 40801 mapped keys (of 54675 keys)
hgu133plus2GO has 35250 mapped keys (of 54675 keys)
hgu133plus2GO2ALLPROBES has 13288 mapped keys (of 13360 keys)
hgu133plus2GO2PROBE has 10091 mapped keys (of 10161 keys)
hgu133plus2MAP has 40571 mapped keys (of 54675 keys)
hgu133plus2OMIM has 27795 mapped keys (of 54675 keys)
hgu133plus2PATH has 11297 mapped keys (of 54675 keys)
hgu133plus2PATH2PROBE has 214 mapped keys (of 214 keys)
hgu133plus2PFAM has 39581 mapped keys (of 54675 keys)
hgu133plus2PMID has 40160 mapped keys (of 54675 keys)
hgu133plus2PMID2PROBE has 276342 mapped keys (of 283543 keys)
hgu133plus2PROSITE has 39581 mapped keys (of 54675 keys)
hgu133plus2REFSEQ has 40248 mapped keys (of 54675 keys)
hgu133plus2SYMBOL has 40801 mapped keys (of 54675 keys)
hgu133plus2UNIGENE has 40632 mapped keys (of 54675 keys)
hgu133plus2UNIPROT has 37123 mapped keys (of 54675 keys)
Additional Information about this package:
DB schema: HUMANCHIP_DB
DB schema version: 2.1
Organism: Homo sapiens
Date for NCBI data: 2010-Sep7
Date for GO data: 20100904
Date for KEGG data: 2010-Sep7
Date for Golden Path data: 2010-Mar22
Date for IPI data: 2010-Aug19
Date for Ensembl data: 2010-Aug5
Bioconductor Experiment Data Packages
Experiment data package contain published data pre-prepared for Bioconductor. For
example these packages include data from leukemia gene expression studies (Golub et
al., 1999), colon cancer gene expression profiling (Alon et al. 1999), etc. Many of the
tutorials (vignettes) in Bioconductor use these data in exercises.
Bioconductor and R Help
The first place to find help on R and Bioconductor is their website. These provide an
excellent source of information.
In particular it is worth examining the manuals, FAQ and searchable archives of the
mailing lists. Do subscribe to the Bioconductor and R mailing lists. When
subscribing click the option “daily digest” as these mailing list as busy with >20 emails a
day and could quickly fill up your email inbox. The daily digest option, send 1 email a
day which contains all the correspondence from that day.
R Help
My favorite beginners guide to R is from Emmanuel Paradis. Its available from
http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Introduction to R classes and objects http://cran.r-project.org/doc/manuals/R-intro.html
Tom Short’s R reference card (http://cran.r-project.org/doc/contrib/Short-refcard.pdf) and
other contributed are useful http://cran.r-project.org/other-docs.html
I have taught several R courses (Bio503, HSPH) and the course notes are online. I hope
you find them useful also. The 2011 course is available at
http://isites.harvard.edu/icb/icb.do?keyword=k76038 but I will this the most recent
course from my HSPH website homepage http://www.hsph.harvard.edu/research/aedin-
culhane/
Bioconductor Help
I found the Bioconductor course and workshop examples very useful when I was
learning. These are available at http://www.bioconductor.org/workshops
For more information on getting started in R and Bioconductor, please see Vince Carey’s
guide at http://bosbioc.wordpress.com/
Thomas Girke, UC Riverside has also written excellent online tutorials on R and
bioconductor available from http://manuals.bioinformatics.ucr.edu/
Interfaces for R
I have used all of these below at one stage or another and all color R code, highlighting
brackets and are useful
Rstudio (my current favorite)
Cross platform- available on Windows, Linux, MacOS etc
http://www.rstudio.org/
TinnR
Only available on Windows
http://www.sciviews.org/Tinn-R/
NotePad++ (with the plugin NpptoR)
Only available on Windows
http://notepad-plus-plus.org/
http://sourceforge.net/projects/npptor/
On Linux I have also used Kate the KDE text editor which has inbuilt highlighting of R
code. On Mac, many users simply use the standard R GUI which is pretty good and has
some features that are better than the version that comes on Windows/Linux.