51
Steve Pittard / BimCore [email protected] Intro to R/BioConductor January 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor January 28, 2011

Intro to R / BioConductor

Page 2: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor January 28, 2011

Intro to R / BioConductor

Contacts: Kim Gernert [email protected]

Lance Waller [email protected]

Steve Pittard [email protected]

Website: www.bimcore.emory.edu/bbseries

Page 3: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor January 28, 2011

Intro to R / BioConductor

Page 4: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Intro to R / BioConductor

“Companies as diverse as Google, Pfizer, Merck, Bank of America, and the InterContinental Hotels Group and Shell use R.”

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

“ R has really become the second language for people coming out of grad school now, and there’s an amazing amount of code being written for it,” said Max Kuhn, associate director of nonclinical statistics at Pfizer.

“ Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially.

Page 5: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

What is R ?

R is an interactive framework for data and statistical analysis that also happens to have a programming language.

Compare this to languages such as Perl, Python, and Java that have data analysis addons

Which language to use ? Exploit the strengths of all of them but if data analysis is a big part of the work then consider using R as part of the “pipeline”.

R can be well integrated with Excel, existing C,C++, Perl, Python, XML, and FORTRAN programs. See www.omegahat.com

Most of the effort in using R relates to shaping the data for analysis and understanding the available functions and packages - deep programming skills are not required but are definitely rewarded.

Page 6: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Where to Find Software & Documentation

Project Home page: http://www.r-project.org/

Comprehensive R Archive Network: http://cran.cnr.berkeley.edu/

Manuals: http://cran.cnr.berkeley.edu/manuals.html

Good Intro Manual: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

FAQs: http://cran.cnr.berkeley.edu/faqs.html

The R Journal: http://journal.r-project.org/

Contributed Documentation: http://cran.cnr.berkeley.edu/other-docs.html

BioConductor: http://www.bioconductor.org (400+ addons for life scientists)

R Graphical User Interface (GUI): http://sciviews.org/_rgui

R Programming Jobs: http://www.programmingr.com/category/stype/r-job-listings

Spreadsheet Addiction: http://www.burnsstat.com/pages/Tutor/spreadsheet_addiction.html

Page 7: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Additional Resources - Books/Courses

R in a Nutshell - Jospeh Adler (2009)

Data Manipulation with R - Phil Spector (2008)

Ggplot2: Elegant Graphics for Data Analyses

A Handbook of Statistical Analyses Using R - Everitt and Hothorn (2006)

BioConductor Case Studies - Hahne, Huber, Gentleman, Falcon (2008)

Introductory Statistics with R - Peter Dalgaard

Software for Data Analysis: Programming with R - John Chambers (2008)

BioConductor Courses & Conferences: www.bioconductor.org/help/events/

Page 8: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Additional Resources - Books/Courses

R in a Nutshell - Jospeh Adler (2009)

Data Manipulation with R - Phil Spector (2008)

Ggplot2: Elegant Graphics for Data Analyses

A Handbook of Statistical Analyses Using R - Everitt and Hothorn (2006)

BioConductor Case Studies - Hahne, Huber, Gentleman, Falcon (2008)

Introductory Statistics with R - Peter Dalgaard

Software for Data Analysis: Programming with R - John Chambers (2008)

BioConductor Courses & Conferences: www.bioconductor.org/help/events/

Page 9: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Additional Resources - CRAN

To obtain information on the wide variety of addons available for then visit http://cran.cnr.berkeley.edu Here are some of the available package

categories. Use the “search” function to drill down:

Page 10: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Additional Resources - CRAN

To obtain information on the wide variety of addons available for then visit http://cran.cnr.berkeley.edu Here are some of the available package

categories. Use the “search” function to drill down:

Page 11: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Clash of The Titans - R vs MATLAB vs SAS

Let the debate begin….. But there are no easy answers here. There are lots of good things about all three.

Page 12: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

To Use A Graphical Interface or Not ?

R is provided with a command line interface (CLI), which is the preferred user interface for power users because it allows direct control on calculations and is flexible. However, good knowledge of the language is required. CLI is thus intimidating for beginners. Some common R GUIs include:

R Commander, JGR, SciViews-R, R.app (OSX), ECLIPSE with STATET, EMACs

Page 13: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Commerically Supported R ?

Revolution Analytics (www.revolutionanalytics.com) claims to provide an enhanced version of R with additional features such as a Visual IDE & Debugger and on-call technical support both for end-user questions and statistical consultation. I have no practical experience with this product and can’t say if its good or not. However for those seeking production support for R this is one possibility.

Page 14: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Style Guides for R Programming

Take some “style” hints from the big guys: See “Google’s R Style Guide” and Hadley’s Style Guide at:

http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html

https://github.com/hadley/devtools/wiki/Style

Ever try reading code written by someone else ?

Page 15: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - The Work Flow

The typical R workflow is:

Data Cleanup/Reshaping --> Model/Analyze/Statistics --> Visualize

Repeat until it makes sense. Data Cleanup can frequently require more time than the second two combined ! Beware. It takes TIME to get used to transforming data. Data sets rarely arrive at your door in the ideal format. Spend time with the R input transformation functions.

Page 16: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Your First R Session

$ R> help.start()> ?mean #Get help on the mean function> example(kmeans) #Run an example of kemans (if it exists)> ??clust #Search for info on clust> x <- rnorm(50)> y <- rnorm(x) #Generate two pseudo-random normal vectors > plot(x, y) #Plot the points in the plane. > ls() #See R objects that are in the R workspace.> x <- 1:20 #Make x = (1, 2, ..., 20).> w <- 1 + sqrt(x)/2> dummy <- data.frame(x=x, y= x + rnorm(x)*w)> fm <- lm(y ~ x, data=dummy)> q() Save workspace image? [y/n/c]:y$

Page 17: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R As a “Calculator”

> x <- 1000> 3+log10(x)[1] 6> m <- 2.0> x <- c(1,2,3,4,5,6,7)> y <- c(2,4,6,8,10,12,14)> mean(y)[1] 8

> x*y #vector multiplication[1] 2 8 18 32 50 72 98

> sqrt(var(x))[1] 2.160247

# Prob of 3 coincident birthdays in 150 people

> pbirthday(150,coincident=3)

[1] 0.9848007

> library(multipol)> a=as.multipol(c(1,2,3))

> b=as.multipol(c(4,5,6))

> a

x^0 x^1 x^2

1 2 3

> b

x^0 x^1 x^2

4 5 6

> polyprod(a,b)

y^0 y^1 y^2

x^0 4 5 6

x^1 8 10 12

x^2 12 15 18

Page 18: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Reading Data Into R

>Biomarkers <- read.csv(“bio.csv”, header=TRUE,sep=“,”) >library(ODBC) >sheet <- “c:\\Documents and Settings\\user\sheet.xls” >con <- odbcConnectExcel(sheet)

Can also use library(XML) to read MS Office docs since they are zipped XML files.

>library(RMySQL); drv <- dbDriver(“MySQL”)

>con <- dbConnect(drv,dbname=,user,password=,host=)

>mydata <- dbGetQuery(con, “SELECT * FROM mydata”)

>wpage <- readLines("http://www.programr.com/list.html") >author_lines <- wpage[grep("<I>", wpage)]

Page 19: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Reading/Writing Data with R

>library(R2HTML) >dir.create(file.path(tempdir(),"R2HTML"))>tgt <- HTMLInitFile(file.path(tempdir(),"R2HTML"),filename="sample", BackGroundColor="#BBBBEE")HTML("<br>This is a test,file=target)>tmp <- as.data.frame(matrix(rnorm(100),ncol=10))>HTML(tmp,file=tgt)>HTMLEndFile()

>load(“My_Saved_WorkSpace.Rdata”) # Binary format >save(x,y,z,file=“precious_data.rda”) # Binary format >write.csv(x,file=“co2.txt”,sep=“,”)

# CSV, text

Page 20: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Reading/Writing Data with R

Function(s) Purpose

read.epinfo Read saved objects from EpiInfo

read.xport Read saved objects in SAS export format

read.spss Read saved objects from SPSS written using the save or export command

read.systat Read saved objects from SYSTAT rectangular (mtype=1) data only

read.dta Read saved objects from STATA (versions 5-9)

read.mtp Read Minitab Portable Worksheet Files

read.octave Read saved objects from GNU octave

read.dbf Read or write saved objects from DBF files (FoxPro, dBase,etc)

Page 21: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Functions

Functional Programming: “Computations are based around functions, which can encapsulate and return specific, meaningful results” *

> rnorm(10) [1] 0.005746913 0.137718058 -0.350398360 -1.258142029 -1.549439542 [6] 0.390550679 -0.368581797 1.050090167 -0.351906139 -1.273418525> avg <- mean(rnorm(10))> avg[1] -0.1479458> quantile(rnorm(10)) 0% 25% 50% 75% 100% -1.1462175 0.3044227 0.4740667 0.8506002 1.3871263 >

* John Chambers, Software for Data Analysis: Programming with R

Page 22: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Objects

Everything, (vector, factor, matrix, array, list , data.frame), is an object, which also has a type and belongs to a class:

> 3+5[1] 8> typeof(3)[1] "double"> class(3)[1] "numeric"> typeof(`+`)[1] "builtin"> > as.Date("12/11/10", "%m/%d/%y") + 1[1] "2010-12-12">

Page 23: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Vectors

R implements vectors quite well. It is a fundamental object type so get familiar with stashing data in vectors. They can be combined into matrices

> X <- c(1,3,2,10,5) #create a vector x with 5 components> Y <- 1:5 #create a vector of consecutive integers> y+2 #scalar addition [1] 3 4 5 6 7> 2*y #scalar multiplication [1] 2 4 6 8 10> y^2 #raise each component to the second power [1] 1 4 9 16 25> X <- c(1,3,2,10,5); y<-1:5 #two or more statements are separated by semicolons> x*y[1] 1 6 6 40 25> diff(x)[1] 2 -1 8 -5> max(x)[1] 10>x[3:5] #returns elements 3 to 5 inclusively[1] 2 10 5> x>3 #Returns a locical vector[1] FALSE FALSE FALSE TRUE TRUE> x[x>3] # Returns those elements greater than 3[1] 10 5

Page 24: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Vectors

> seq(10) [1] 1 2 3 4 5 6 7 8 9 10> seq(0,1,length=10) [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 [8] 0.7777778 0.8888889 1.0000000> seq(0,1,by=0.1) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0> rep(1,3)[1] 1 1 1> c(rep(1,3),rep(2,2),rep(-1,4))[1] 1 1 1 2 2 -1 -1 -1 -1> g <- sample(c(12,17,24),50,prob=c(20,30,50),replace=TRUE)> g [1] 17 24 17 24 24 24 17 24 12 17 24 12 24 17 24 17 24 24 24 24 17 17 12 17 24[26] 24 24 24 24 12 17 24 17 17 12 24 12 17 12 17 24 17 17 24 24 24 12 24 24 >length(g[g>=24])[1] 25

Page 25: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Matrices

R also supports matrices, which are objects that typically refer to a numeric array of rows and columns. An easy way to create a matrix is to combine vectors of equal length using the “cbind()” function (column bind). > x[1] 1 3 2 10 5> y[1] 1 2 3 4 5> mymat <- cbind(x,y); mymat # Create the matrix x y[1,] 1 1[2,] 3 2[3,] 2 3[4,] 10 4[5,] 5 5> t(mymat) # Transpose it [,1] [,2] [,3] [,4] [,5]x 1 3 2 10 5y 1 2 3 4 5> mymat[2,] # Get the second row of mymatx y 3 2

Page 26: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023 0

Concepts for Using R - Matrices

R also supports matrices, which are objects that typically refer to a numeric array of rows and columns. An easy way to create a matrix is to combine vectors of equal length using the “cbind()” function (column bind).

> ex1 <- matrix(1:4,ncol=2) # Use the “matrix” command> ex2 <- matrix(c(20,40,40,40),ncol=2)> > ex1 [,1] [,2][1,] 1 3[2,] 2 4> ex2 [,1] [,2][1,] 20 40[2,] 40 40> solve(ex1) # Find the inverse [,1] [,2][1,] -2 1.5[2,] 1 -0.5> diag(ex2) # Get the diagonal of ex2[1] 20 40>

Page 27: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Data Frames

A data frame is an object that contains data in a format that allows for easier manipulation, reshaping, and open-ended analysis.

Data frames are tightly coupled collections of variables. It is one of the more important constructs you will encounter when using R so learn all you can about it.

> head(clinical.trial) patient age treatment center1 1 67.28203 Treatment Center E2 2 65.74676 Treatment Center B3 3 72.69997 Treatment Center A4 4 58.33269 Treatment Center E5 5 67.31442 Treatment Center B

> clinical.trial$age> head(clinical.trial$age)[1] 67.28203 65.74676 72.69997 58.33269 67.31442 70.84661

Page 28: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Data Frames

> info <- data.frame(gender = c("M", "M", "F"), ht = c(172, 186.5, 165), wt = c(91, 99, 74))

> info gender ht wt1 M 172.0 912 M 186.5 993 F 165.0 74> info[1,2][1] 172> names(info)[1] "gender" "ht" "wt" > info$ht[1] 172.0 186.5 165.0> rownames(info) <- c("S1","S2","S3")> info gender ht wtS1 M 172.0 91S2 M 186.5 99S3 F 165.0 74

Page 29: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Concepts for Using R - Data Frames

> height <- info$ht> height1] 172.0 186.5 165.0

> info$age = c(28,55,43)

> info gender ht wt ageS1 M 172.0 91 28S2 M 186.5 99 55S3 F 165.0 74 43

> subset(info,age > 50 ) gender ht wt ageS2 M 186.5 99 55

>

Page 30: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - The Work Flow

The typical R workflow is:

Data Cleanup/Reshaping --> Model/Analyze/Statistics --> Visualize

Repeat until it makes sense. Data Cleanup can frequently require more time than the second two combined ! Beware. It takes TIME to get used to transforming data. Data sets rarely arrive at your door in the ideal format. Spend time with the R input transformation functions.

Page 31: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data CleanUp

Data Cleanup/Reshaping: cbind, rbind, merge, cut> top.salaries name.last name.first team position salary1 Manning Peyton Colts QB 187000002 Brady Tom Patrots QB 1462677203 Pepper Julius Panthers DE 141375004 Palmer Carson BEngals QB 13980000

>year <- c(2008,2008,2008,2008)

>rank <- c(1,2,3,4)

>more.cols <- data.frame(year,rank)

>cbind(top.salaries,more.cols) name.last name.first team position salary year rank1 Manning Peyton Colts QB 18700000 2008 12 Brady Tom Patrots QB 146267720 2008 23 Pepper Julius Panthers DE 14137500 2008 34 Palmer Carson BEngals QB 13980000 2008 4

Page 32: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data CleanUp

Data Cleanup/Reshaping: cbind, rbind, merge, cut

> next.three name.last name.first team position salary1 Favre Brett Packers QB 128999992 Bailey Champ Broncos CB 126900503 Harrison Marvin Colts WR 12000000

> rbind(top.salaries,next.three) name.last name.first team position salary1 Manning Peyton Colts QB 187000002 Brady Tom Patrots QB 1462677203 Pepper Julius Panthers DE 141375004 Palmer Carson BEngals QB 139800005 Favre Brett Packers QB 128999996 Bailey Champ Broncos CB 126900507 Harrison Marvin Colts WR 12000000

Page 33: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data CleanUp

Data Cleanup/Reshaping: rbind,cbind, merge, cut:> tb1 indiv_id SNP1 SNP21 1 1 12 2 1 13 3 0 04 4 1 0> tb2 indiv_id cov1 cov21 1 1.14 74.62 3 4.50 79.43 4 0.80 48.24 6 1.39 68.1> merge(tb1,tb2,by="indiv_id",all=TRUE) indiv_id SNP1 SNP2 cov1 cov21 1 1 1 1.14 74.62 2 1 1 NA NA3 3 0 0 4.50 79.44 4 1 0 0.80 48.25 6 NA NA 1.39 68.1

Page 34: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data CleanUp

Data Cleanup/Reshaping: rbind,cbind, merge, cut:

> clinical.trial patient age year.enroll1 1 47.41745 19952 2 64.92090 19893 3 49.66592 19874 4 68.23656 1985..100 100 49.45632 1988

> c1 <- cut(clinical.trial$age, breaks = 4)> table(c1)c1 (41.1,50] (50,58.8] (58.8,67.6] (67.6,76.4] 9 34 41 16

Page 35: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data Modeling

Data Modeling: Linear Regression

>ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)>trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)>group <- gl(2,10,20, labels=c("Ctl","Trt"))>weight <- c(ctl, trt)>m.D9 <- lm(weight ~ group)

>summary(resid(m.D9)) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.071e+00 -4.938e-01 6.850e-02 2.498e-17 2.462e-01 1.369e+00 >opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))>coef(m.D9)(Intercept) groupTrt

5.032 -0.371

>plot(m.D9, las = 1)

Page 36: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data Modeling

The typical R workflow is to 1) clean, massage, and reshape data, 2) model it, and 3) visualize it:

Page 37: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Data Modeling

Data Modeling: K-Means Clustering

> x <- rbind(matrix(rnorm(1000, sd = 0.3), ncol = 2), matrix(rnorm(1000, mean = 1, sd = 0.3), ncol = 2)) # Create two populations

> colnames(x) <- c("x", "y")> cl <- kmeans(x, 2)> plot(x, col = cl$cluster, main=“Rigged Example”)> c1 <- kmeans(x, 4)

Page 38: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Graphics

> fuel row.names Weight Disp. Mileage Fuel Type1 Eagle.Summit.4 2560 0.97 33 3.030303 Small2 Ford.Escort.4 2345 114.00 33 3.030303 Small3 Ford.Festiva.4 1845 0.81 37 2.702703 Small4 Honda.Civic.4 2260 0.91 32 3.125000 Small5 Mazda.Protege.4 2440 113.00 32 3.125000 Small6 Mercury.Tracer.4 2285 0.97 26 3.846154 Small..59 Nissan.Axxess.4 3185 146.00 20 5.000000 Van60 Nissan.Van.4 3690 146.00 19 5.263158 Van

> names(fuel)[1] "row.names" "Weight" "Disp." "Mileage" "Fuel" "Type"

> hist(fuel$Mileage,col="green", border="black",main="R Histogram Demo",breaks=25,xlab="Auto Fuel Mileage")

Page 39: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Graphics

>hist(fuel$Mileage,col="green", border="black",main="R Histogram Demo",breaks=25,xlab="Auto Fuel Mileage")

Page 40: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Graphics

>boxplot(Weight~Mileage,data=fuel,main="Box Plot + Example",xlab="Mileage",ylab="Weight in lbs")

Page 41: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

R - Graphics

> plot(fuel$Fuel, type="l")> plot(fuel$Fuel, type="b"); title("both")> plot(fuel$Fuel, type="o"); title("overstruck")> plot(fuel$Fuel, type="h"); title("high density")

Page 42: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

ggplot2 - A Better Graphics Package ?

>ggplot(diamonds, aes(depth, fill = cut)) + geom_density(alpha = 0.2) + xlim(55, 70)

Page 43: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

BioConductor

Page 44: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

Affymetrix data is distributed in .CEL files. The Bioconductor “affy” package has methods to read these files and do some intelligent processing on them. In particular the “ReadAffy()” method.

>library(affy); library(CLL) #Read in some example data from CLL >data(CLLbatch) #Read in some example data from CLL

> CLLBatch AffyBatch objectsize of arrays=640x640 features (91215 kb)cdf=HG_U95Av2 (12625 affyids)number of samples=24number of genes=12625annotation=hgu95av2notes=

>sampleNames(CLLbatch)[1] "CLL10.CEL" "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" "CLL15.CEL"[7] "CLL16.CEL" "CLL17.CEL" "CLL18.CEL" "CLL19.CEL" "CLL1.CEL" "CLL20.CEL"[13] "CLL21.CEL" "CLL22.CEL" "CLL23.CEL" "CLL24.CEL" "CLL2.CEL" "CLL3.CEL"

[19] "CLL4.CEL" "CLL5.CEL" "CLL6.CEL" "CLL7.CEL" "CLL8.CEL" "CLL9.CEL"

Page 45: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

> class(CLLbatch)[1] "AffyBatch"attr(,"package")[1] "affy”

> slotNames(CLLbatch) [1] "cdfName" "nrow" "ncol" [4] "assayData" "phenoData" "featureData" [7] "experimentData" "annotation" "protocolData"

> CLLbatch@annotation[1] "hgu95av2”> CLLbatch@phenoDataAn object of class "AnnotatedDataFrame" sampleNames: CLL10.CEL, CLL11.CEL, ..., CLL9.CEL (24 total) varLabels and varMetadata description: sample: arbitrary numbering

> CLLbatch@featureDataAn object of class "AnnotatedDataFrame" featureNames: 1, 2, ..., 409600 (409600 total) varLabels and varMetadata description: none

Page 46: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

> CLLbatch[1,1:3]AffyBatch objectsize of arrays=640x640 features (24012 kb)cdf=HG_U95Av2 (12625 affyids)number of samples=3number of genes=12625annotation=hgu95av2notes=> CLLbatch[1,1:3]@assayData # Look at the expression data$exprs CLL10.CEL CLL11.CEL CLL12.CEL1 183.0 113.0 119.02 11524.8 6879.8 7891.3..

> library(affyQCReport) # Load up some QC methods> saqc = qc(CLLbatch)> slotNames(saqc)[1] "scale.factors" "target" "percent.present" [4] "average.background" "minimum.background" "maximum.background" [7] "spikes" "qc.probes" "bioBCalls" [10] "arraytype"

Page 47: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

> plot(saqc) # Look at the Quality Control Plot

Page 48: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

> dd = dist2(log2(exprs(CLLbatch)))> dd.row = as.dendrogram(hclust(as.dist(dd)))> plot(dd.row)

Determine if one (or more) arrays are significantly different from the others. Computer the median of the absolute values of differences between pairs of arrays

Page 49: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

> library(arrayPLM) # Load up some more diagnostics> sampleNames(CLLbatch) = sub("\\.CEL$", "", sampleNames(CLLbatch))> dataPLM = fitPLM(CLLbatch)# Plot NUSE - Normalized Unscaled Standard Error> boxplot(dataPLM, main=“NUSE”,ylim=c(0.95,1.22), outline=FALSE, las=3,

whisklty=0, staplety=0)

Page 50: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

An Example - Affymetrix Data

# Now plot RLE - Relative Log Expression

>Mbox(dataPLM, ylim=c(-0.4,0.4), main="RUSE",col=”red",las=3,whisklty=0,staplelty=0)

Page 51: Steve Pittard / BimCore wsp@emory.eduIntro to R/BioConductorJanuary 28, 2011 Intro to R / BioConductor

Steve Pittard / BimCore [email protected] Intro to R/BioConductor April 19, 2023

Acknowledgements

The following sources were used in building this presentation:

* BioConductor Case Studies - Hahne, Huber, Gentleman, Falcon (2008)

* “An Interactive Introduction To R (Programming Language For Statistics)” http://www.slideshare.net/dataspora/an-interactive-introduction-to-r-programming-language-for-statistics

* Blogistic Reflections: http://blogisticreflections.wordpress.com/page/2/

* Statistical Computing with R: A Tutorial

http://math.illinoisstate.edu/dhkim/rstuff/rtutor.html