Upload
vishal-virmani
View
215
Download
0
Embed Size (px)
Citation preview
8/10/2019 r Programming Life Sciences Aug 2009
1/120
R Programming for Life Scientists
Version 2.0Raymond R. Balise, Ph.D.
Health Research and PolicySpectrum
8/10/2019 r Programming Life Sciences Aug 2009
2/120
Roadmap
What makes R different for the rest? Setting up R
Types of data Working with collections of data Importing and exporting data
Writing functions Graphics
8/10/2019 r Programming Life Sciences Aug 2009
3/120
When to Use R
Shoestring budget Cutting edge statistics
Developing your own or fine-tuning existingmethods Local expertise
8/10/2019 r Programming Life Sciences Aug 2009
4/120
Programming Languages
Procedural languages C, Fortran, Cobol, Basic use a model where the logic flows from the top of
the page to the bottom with calls to gotosubroutines as needed
It is hard to encapsulate the code.
Object oriented languages C++, Visual Basic, JAVA involves creating objects and then operating on them
8/10/2019 r Programming Life Sciences Aug 2009
5/120
R is Object Oriented (OO)
You create objects vector of numbers, a graphic, etc.
You call methods/functions to operate on the
objects. Working with an OO language requires you to
learn about special methods to create, access,modify, or destroy objects and their properties. R hides these processes. It helps a lot if you want to write new statistics and
methods and is required for making new packages.
8/10/2019 r Programming Life Sciences Aug 2009
6/120
OO Example
With R you write code in the editor which I willshow you in a minute.
You can create an object which holds a bunch of
numbers (a vector, if you remember math) You can then use (aka call ) a function (aka
method ) to operate on the object. The summary() function
Create and display a numeric summary object
The plot() function Create and display a graphic summary object
8/10/2019 r Programming Life Sciences Aug 2009
7/120
Make theages object
Call thesummaryfunction.
Call theplot
function.
8/10/2019 r Programming Life Sciences Aug 2009
8/120
But wait theres more!
There is a lot of functionality built into R. Itships with libraries that do many differenttasks. And you can download more.
Map most of theUSA.
Activate the mapdatasets and
functions.
8/10/2019 r Programming Life Sciences Aug 2009
9/120
But hold on. There is MORE! You can add options to the function calls to
make them do fancy things like color. Or you can have one function act on the
output of another function.
And you can save output as objects!
8/10/2019 r Programming Life Sciences Aug 2009
10/120
Important Objects
Vectors are lists of numbers. Dataframes are like database or spreadsheets.
8/10/2019 r Programming Life Sciences Aug 2009
11/120
8/10/2019 r Programming Life Sciences Aug 2009
12/120
Where to Get R
R has two main websites. One describes the project:http://www.r-project.org/
The other has most of the stuff you want to
download:http://cran.r-project.org/
Because the R project has people working all overthe globe, the software download site is mirrored
everywhere. The closest mirror is USA CA1 (aka UCBerkeley).
http://www.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/8/10/2019 r Programming Life Sciences Aug 2009
13/120
http://cran.cnr.berkeley.edu/
There is an R installer for all the commonoperating systems:
cran.cnr.berkeley.edu/bin/windows/base/ cran.cnr.berkeley.edu/bin/macosx/ cran.cnr.berkeley.edu/bin/linux/
Each is basically self explanatory.
http://cran.cnr.berkeley.edu/bin/windows/base/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/windows/base/8/10/2019 r Programming Life Sciences Aug 2009
14/120
8/10/2019 r Programming Life Sciences Aug 2009
15/120
Installing on Windows
Double click the installer and just push nextuntil you get to this screen.
Specify that youwant to docustomized startup.
This will let you setup R to work withother programsnicely.
8/10/2019 r Programming Life Sciences Aug 2009
16/120
Customize
Use these options, then hit Next> a bunch.
8/10/2019 r Programming Life Sciences Aug 2009
17/120
help.start() and push enter to start the help. q() and push enter to quit but dont yet.
8/10/2019 r Programming Life Sciences Aug 2009
18/120
GUIUse the built in
editor.
Save or restore allthe objects in use.
Save or reload thecode from theconsole.
Keep all the text inthe console for the
session.
Set the working
directory to saveobjects.
8/10/2019 r Programming Life Sciences Aug 2009
19/120
GUIEdit existing data.
Tweak theappearance of the
console.
8/10/2019 r Programming Life Sciences Aug 2009
20/120
Rprofile.site
If you have instructions that you always wantrun when R starts up, you can include them inthe Rprofile.site file:
8/10/2019 r Programming Life Sciences Aug 2009
21/120
GUI
Commoncommands.
Show the add onpackages currently
accessible.
8/10/2019 r Programming Life Sciences Aug 2009
22/120
Packages in R
User-supplied packages are typically found atone of three places: CRAN for all kinds of stuff
Omegahat for web-based statistics Bioconductor for genomic analysis
R packages update often.
Your colleagues will recommend task-specificpackages. Rcmdr is my favorite.
8/10/2019 r Programming Life Sciences Aug 2009
23/120
GUIUse a previously
downloaded package.I type library(name)
instead.USA (CA1) is closest
to Stanford.
Choose which set ofpackages to look at.
See the HUGE list ofpackages.
Update often!
8/10/2019 r Programming Life Sciences Aug 2009
24/120
GUI
This is useful.
8/10/2019 r Programming Life Sciences Aug 2009
25/120
HTMLhelp
This is usefulbut not
Google.
This will not findinformation if youhave not installedthe packages.
8/10/2019 r Programming Life Sciences Aug 2009
26/120
Rseek.org is Google-driven
I highly recommend it.
8/10/2019 r Programming Life Sciences Aug 2009
27/120
Mac Quick HelpSearch help for the
word "map".Search for details
on a function if thepackage is loadedand you know thefunctions name.
h h l f h
8/10/2019 r Programming Life Sciences Aug 2009
28/120
Windows Quick HelpSearch help for the
word "map".
Search help for thefunction named"map".
Load the package.
8/10/2019 r Programming Life Sciences Aug 2009
29/120
Mac Install
Download and double click the dmg file.
Click customize andmake sure Tcl/Tk is
checked on.
8/10/2019 r Programming Life Sciences Aug 2009
30/120
X11
Some packages for R on the Mac (like Rcmdr)require X11 to be installed. I think it is part of the standard Leopard
installation but was an option with Tiger. If youneed it, try to install it off of the DVD that camewith your machine because people have reportedusing the dmg files from Apple.com.
8/10/2019 r Programming Life Sciences Aug 2009
31/120
X11 and Add-on Packages To get add onpackages, use thismenu.
You can click hereto make sure X11
works.
8/10/2019 r Programming Life Sciences Aug 2009
32/120
Getting or Updating Packages
ClickGet List , click the package name, be sureinstall dependencies is checked on, then clickinstall .
8/10/2019 r Programming Life Sciences Aug 2009
33/120
Instead of Point and Click
You can also run this code to have Mac orWindows R download a list of packages:
usefulPackages = c("car", "foreign", "hexbin", "gdata","ggplot2", "gmodels", "gplots", "Hmisc", "reshape","Rcmdr")
install.packages(usefulPackages, dependencies = TRUE)
Be sure to take note of any packages that do not install.
marray , affy, Biobase , Rgraphviz were not available
8/10/2019 r Programming Life Sciences Aug 2009
34/120
I suggest you install the Rcmdr package firstthing. Use the Install packages option on the package
menu to download Rcmdr To make it available for your R session type:
library(Rcmdr) CAPITALIZATION MATTERS! The first time you run it, it will ask you if it can
download additional packages.
Your First Package
8/10/2019 r Programming Life Sciences Aug 2009
35/120
8/10/2019 r Programming Life Sciences Aug 2009
36/120
If you are on Windows you
can directly import Excel.
On a Mac you cannot directly import
from Excel.
8/10/2019 r Programming Life Sciences Aug 2009
37/120
Hate Typing?
Tab is your fiend. It will auto-complete if itcan or give you a list of functions that matchwhat you have typed. It woks very well on theMac. In Windows sometimes you need to typetab twice.
In Windows if you type tab after a ( it displaysoptions for the function or they just appear inthe Mac.
8/10/2019 r Programming Life Sciences Aug 2009
38/120
8/10/2019 r Programming Life Sciences Aug 2009
39/120
Data Set Objects
Vectors A bunch of data in a single row or column All of the same type
Matrix
A row and column arrangement of data All of the same type Data frame
A row and column arrangement of data Columns are of different types
List Very free-form structure A grouping of different types of data
Like a good spreadsheetor relational database file
8/10/2019 r Programming Life Sciences Aug 2009
40/120
Types of Data Vectors
Numeric Integer, real, and complex are different types but
you will not need to pay attention to the details
NA means missing NAN means not a number
String Characters of the alphabet
Logical TRUE, FALSE or NA
8/10/2019 r Programming Life Sciences Aug 2009
41/120
8/10/2019 r Programming Life Sciences Aug 2009
42/120
Making Vectors With c()
c stands for concatenateages = c(9, 11, 40, 41) ; agesstooges = c("Larry", "Moe", "Curly", "Shemp"); stooges
8/10/2019 r Programming Life Sciences Aug 2009
43/120
Getting Details
You can use is functions and length to getdetails on a vector.is.vector(ages)
is.numeric(ages)is.logical(ages)length(ages)
8/10/2019 r Programming Life Sciences Aug 2009
44/120
You can add one to all four ages.ages + c(1,1,1,1)
If you provide the scalar integer, R willtemporarily vectorize the 1 by recycling thatvalue to match the length of the ages vector.ages + 1
It will recycle a series also.agesages + c(1,2)
Recycling and Vectorizing
8/10/2019 r Programming Life Sciences Aug 2009
45/120
Naming Parts of a Vector
You can assign names to the elements of a vector.This allows later access to the elements using thenames instead of the position.names(ages) = stoogesages
To erase them:names(ages) = NULL; ages
Notice what happens when the lengths differ:stooges= c("Larry", "Moe", "Curly")names(ages) = stoogesages
8/10/2019 r Programming Life Sciences Aug 2009
46/120
Attributes
When you add names to things (objects) theyacquire or change their names attribute. attributes(ages)
When you strip off the names, the vector isleft with no attributes.names(ages) = NULLattributes(ages)
8/10/2019 r Programming Life Sciences Aug 2009
47/120
A data frame is an object with manyattributes.
R ships with a lot of datasets if you want one help.start()
Click packages then datasets.esoph?esophattributes(esoph)
Complex Objects
8/10/2019 r Programming Life Sciences Aug 2009
48/120
Getting at Parts of a Vector
Specify the element number.heyMoe = ages[2] ; heyMoe
Specify to drop everything except the elementnumber.ages[c(-1, -3, -4)]
Specify a list with TRUE and FALSEages[c(FALSE, TRUE, FALSE, FALSE)]
8/10/2019 r Programming Life Sciences Aug 2009
49/120
Getting Parts with Names
ages = c(9, 11, 40, 41) ; agesnames(ages) = c("Larry", "Moe", "Curly", "Shemp")ages
Specify the name.heyMoe = ages["Moe"]
8/10/2019 r Programming Life Sciences Aug 2009
50/120
Duplicate Names
That code only returns the first one if thereare duplicates.names(ages)[4] = "Moe"
agesheyMoe = ages["Moe"]heyMoe
Gives all if duplicates names(ages) %in% "Moe"ages[names(ages) %in% "Moe"]
8/10/2019 r Programming Life Sciences Aug 2009
51/120
Parts of a Data Frame
You can select columns of a data frame justlike you selected elements from a vector.booze = esoph["alcgp"]
is.data.frame(booze)esoph[2]esoph[c(4,5)]
8/10/2019 r Programming Life Sciences Aug 2009
52/120
Choosing Records
If you put a single item or series inside of thesquare brackets, R thinks you are requestingcolumns.
If you want to get access to specific rows, youinclude a comma after the rows. blah[rows, columns]
esoph[ 1 , ]esoph[ 1:3 , ]
8/10/2019 r Programming Life Sciences Aug 2009
53/120
Smarter Access to a Vector
You can use logic checks to find the recordnumbers in a vector which meet your criteria.ages < 21
which(ages < 21)
You can then subset down your data to therecords of interest using the [ ] subset
operator.ages[which(ages < 21)]ages[ages < 21]
8/10/2019 r Programming Life Sciences Aug 2009
54/120
8/10/2019 r Programming Life Sciences Aug 2009
55/120
Subset a Data Frame
Recall that you can select rows withframeName[rows,columns] and if you do notinclude a comma, all records are chosen.
which(esoph$ncases > 0) gives you a list ofrecords which adhere to that rule. Therefore,the code below gives you a subset
esoph[ which(esoph$ncases > 0) , ]oresoph[esoph$ncases > 0 , ]
8/10/2019 r Programming Life Sciences Aug 2009
56/120
8/10/2019 r Programming Life Sciences Aug 2009
57/120
Choosing Values
If you need specific values, you can use the & (and) or the | (or) operators to get theordered set of TRUE and FALSE values.ages > 21 & ages < 41
! means not!(ages > 21 & ages < 41)
Notice that it is applying the one logic checkto the vector of ages. How does it do that?
8/10/2019 r Programming Life Sciences Aug 2009
58/120
Math on Data Frame Columns
You have seen how to do scalar and vectoralgebra. Algebra on a data frame is easy.names(esoph)esoph$total=esoph$ncases + esoph$ncontrols
To see the end of the data frame, use tail()tail(esoph)
8/10/2019 r Programming Life Sciences Aug 2009
59/120
Comparing Against Vectors
This one uses
recycling andgives wronganswers.
What happens when you try to compare a vectorto a set of things?gender = c(NA, "Male", "Female", "Blue", "Female")gender == "Male" | gender == "Female"gender == c("Male", "Female")
R recycles the shorter vector to be the longerlength, then does the comparison. Use the %in%
operator if you want to compare as if you wrote aseries of or statements.gender %in% c("Male", "Female")
8/10/2019 r Programming Life Sciences Aug 2009
60/120
Categorical Variables
R makes a distinction between variables holdinga bunch of characters from the alphabet andvariables holding categorical information. If
you have a classification/categorical variable,you want R to treat it as a factor or an orderedfactor. Typical factors are treatment or gender.dose = c("low", "placebo", "high", "low")
dosetypeof(dose)
8/10/2019 r Programming Life Sciences Aug 2009
61/120
Factors
To convert a character variable to a factor, use theas.factor function.doseF = as.factor(dose)
typeof(doseF)class(doseF)
Behind the scenes, the character variable isconverted into numbers and the numbers are
given character strings to display. In modern R the levels of the factor are ordered
alphabetically and the first one is representedwith the digit 1, the second is 2, etc.
There are is. or as. predicate functionsto check object types or convert
between types of objects.
8/10/2019 r Programming Life Sciences Aug 2009
62/120
Comparing Factors
Notice wrong answer thanks to recycling.
You can compare a factor vs. a constant value.doseF == "high"as.integer(doseF) == 1
Or you can compare vs. vectors (CAREFULLY).doseF == c("high", "low")doseF %in% c("high", "low")
R will stop you from comparing factors that havedifferent categories.doseF2 = as.factor(c("blah", "placebo", "high", "low"))doseF == doseF2
8/10/2019 r Programming Life Sciences Aug 2009
63/120
Recoding Factors
Often you will want to regroup factor levels.amount=as.factor(c("placebo", "10mg", "5mg", "10mg"))levels(amount)
regroup = list(none="placebo", some=c("5mg", "10mg"))levels(amount) = regroupamount
noneplacebo
some5mg
10mg
8/10/2019 r Programming Life Sciences Aug 2009
64/120
Numeric Factors
If you have numeric factors, be carefulconverting from factors back to numbers.ID = c(1000, 1000, 1001, 2)
IDf = factor(ID)as.integer(IDf)levels(IDf)
numbersAgain = as.numeric(levels(IDf))[IDf]
8/10/2019 r Programming Life Sciences Aug 2009
65/120
8/10/2019 r Programming Life Sciences Aug 2009
66/120
Easier Recoding
Other packages like car have functions to recode:library(car)newAge2=recode(ages, ' 1:21="Young"; else= "Old" ')
newAge2detach("package:car")
8/10/2019 r Programming Life Sciences Aug 2009
67/120
8/10/2019 r Programming Life Sciences Aug 2009
68/120
Attaching Data Frames
People who really hate typing attach dataframes so they can refer to them with shortnames.bmi = women$weight / women$height ^2 * 703
Instead you can attach the women data frameand an easier formula write:
attach(women)search()bmi = weight / height ^2 * 703; bmi
detach(women)
8/10/2019 r Programming Life Sciences Aug 2009
69/120
Keeping Track
You can see what datasets are in each of thework environments/packages with the liststuff function ls().rm(list=ls(all=TRUE))ls()search()
attach(women)ls()search()
head(women); head(height)
8/10/2019 r Programming Life Sciences Aug 2009
70/120
datasets
women
women
heightweight
.GlobalEnv
Look 1 st Look 2 nd Look 3 rd
8/10/2019 r Programming Life Sciences Aug 2009
71/120
Adding a Variable & Making a DF
women$bmi = weight / height ^2 * 703; bmihead(women)ls()
datasets
women
women
heightweight
.GlobalEnv
women
Look 1 st Look 2 nd Look 3 rd
The data frame
with bmiand the data
frame withoutbmi
8/10/2019 r Programming Life Sciences Aug 2009
72/120
Making a Data Frame
Frequently you will want to make data frames foranalysis with Rcmdr. Use the data.frame()command:
attach(sleep)pair = data.frame(extra[group=="A"], extra[group=="B"])
8/10/2019 r Programming Life Sciences Aug 2009
73/120
Using Rcmdr for a paired t-testClick here.
8/10/2019 r Programming Life Sciences Aug 2009
74/120
Loading Text Data into R
Reading text files:fakeAlleles=read.table("c:\\blah\\fakeAlleles.txt",
header=TRUE) See if it worked:
fakeAllelesnames(fakeAlleles)summary(fakeAlleles)fakeAlleles$dude = as.character(fakeAlleles$dude)
fakeAlleles A better option:
fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt", header =TRUE, colClasses = c("character", "factor","factor"))
8/10/2019 r Programming Life Sciences Aug 2009
75/120
Other Text Formats
Other text reading methods:read.csv = coma separated values read.csv2 = semicolon delimited files read.delim = read tab delimited files
read.fwf = read fixed width format files Use same options as read.table If the data has bad or no column headings you may
also want to include:read.table ( stuff, col.names = c("name1", "name2") )
To prevent characters from coming in as factors:options(stringsAsFactors = FALSE)
8/10/2019 r Programming Life Sciences Aug 2009
76/120
Data Frames
The data imported into a data frame.class(fakeAlleles)
A data frame really is a list of vectors where thevectors are all the same length.as.list(fakeAlleles)
To select a column you specify the data frame $ variable name.theDudes = fakeAlleles$dude
All the stuff you saw for logic checks on vectorscan be used on the parts of a data frame.fakeAlleles$allele1 == "A"
8/10/2019 r Programming Life Sciences Aug 2009
77/120
Subsetting Vectors (again)
Recall that you can subset using the [ ] operator:ages = c(9, 11, 40, 41)
heyMoe = ages[2]ages
8/10/2019 r Programming Life Sciences Aug 2009
78/120
Subsetting Data Frames
Parts (subsets) of data frames are referencedby "column numbers comma row numbers": The first record: fakeAlleles[1, ]
The 2 nd and 3 rd columns: fakeAlleles[ , c(2,3)] The genotype for record 6: fakeAlleles[6, c(2,3)]
or by names:fakeAlleles[, c("allele1", "allele2")]
8/10/2019 r Programming Life Sciences Aug 2009
79/120
8/10/2019 r Programming Life Sciences Aug 2009
80/120
Getting Counts with Rcmdr
8/10/2019 r Programming Life Sciences Aug 2009
81/120
Subsetting Using Logic
You can use logic checks to subset:fakeAlleles$allele1 == "A" & fakeAlleles$allele2 =="A"fakeAlleles[ fakeAlleles$allele1 == "A" &
fakeAlleles$allele2 =="A", ]
8/10/2019 r Programming Life Sciences Aug 2009
82/120
Importing From Excel
If you have PERL on your machine, you canuse the read.xls() function in the gdata libraryto easily get data out of Excel and into a data
frame. Mac has PERL Windows
http://www.activestate.com/activeperl/
d l
http://www.activestate.com/activeperl/http://www.activestate.com/activeperl/8/10/2019 r Programming Life Sciences Aug 2009
83/120
Using read.xls
Windows:library(gdata)sleepy = read.xls("c:\\blah\\sleep.xls")
Mac:library(gdata)read.xls("/users/balise/desktop/sleep.xls")
Its that easy Behind the scenes it is convertingthe xls file into a csv so you can use the textimporting options.
Do summary() on the data frame and notice whathappens to the missing value.
8/10/2019 r Programming Life Sciences Aug 2009
84/120
RODBC
ODBC is a language/convention for accessingdatabases. R allows you to use ODBCconnections to burrow directly into databases
and other data containers like Excel.library(RODBC)channel
8/10/2019 r Programming Life Sciences Aug 2009
85/120
SQL
If you have to learn one programminglanguage, learn SQL. With it you can manipulate data stored in nearly
every commercial database. You can aggregate, subset and modify data. It is well implemented inside of both R and SAS.
SQL with R is nicely documented in Spector's(2008) Data Manipulation with R . It is a mustown for people who want to learn R.
E i T Fil
http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=18/10/2019 r Programming Life Sciences Aug 2009
86/120
Exporting Text Files
R can write objects full of data, including dataframes, into text files. By default, it will quote the character string and fill
in the letters NA where there were originallymissing values.
This code exports back to the original
appearance.write.table(sleepy, file = "c:\\blah\\exported.tab",sep ="\t", quote = FALSE, na ="")
8/10/2019 r Programming Life Sciences Aug 2009
87/120
Office 2007 Excel
ODBC connection1. Control Pannels,2. Double click Adminstrative Tools3. Double click Data Sources (ODBC)
4. On the USER DNS tab choose ADD5. Click Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)6. Give the connection a name and browse to the file.
Jot down the name of the connection for the R code.
U i ODBC i
8/10/2019 r Programming Life Sciences Aug 2009
88/120
Using an ODBC connection
Once the ODBC connection is set-up use code like this:library(RODBC)connection = odbcConnect("sleepODBC")dataFromODBC= sqlFetch(connection, "Sheet1")odbcClose(connection)
C i P
8/10/2019 r Programming Life Sciences Aug 2009
89/120
Creating Programs
You can write line-by-line instructions in the Rconsole, use the editors built into R, or use a thirdparty editor (like Tinn-R for windows or JGR).
Console Type history() to see the lines you have submitted
recently and then save to a file and re-run it later ifneeded.
Built-in Editor
Mac: Click the blank page at top of the console Windows: File > New Script
Wi d Ti R Edi
8/10/2019 r Programming Life Sciences Aug 2009
90/120
Windows Tinn-R Editor
http://www.sciviews.org/Tinn-R/index.html
OO P i i R
http://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.html8/10/2019 r Programming Life Sciences Aug 2009
91/120
OO Programming in R
OO programming requires objects classes
describe specific properties for groups of objects inheritance
classes related to eachother (derived from other classes) have relatedproperties
polymorphism the same function name applied to different classes does different things
R vs. JAVA: R typically has separate classes for actions instead ofbundling them with the data structures
JAVA Animal -> domesticated -> dog (walks) R
Animal -> domesticated -> dog Movement -> Walks
P l hi i F
8/10/2019 r Programming Life Sciences Aug 2009
92/120
Polymorphism is Fun
plot does different things depending on thefunction arguments:plot(sleepy)
plot(sleepy$extra)plot(sleepy$baseline, sleepy$extra, sleepy$group)
Take a look at how it works:isS4(plot)methods(plot)getAnywhere(plot.factor)
8/10/2019 r Programming Life Sciences Aug 2009
93/120
W iti g F ti
8/10/2019 r Programming Life Sciences Aug 2009
94/120
Writing Functions
You can easily write functions, but notice that the lastthing calculated is returned:MandM = function(x){mean(x); median(x)}MandM(sleepy$extra) # returns only the median
Store the values you want into a list:MandM = function(x) {blah = list(theMean=0, theMedian=0)blah$theMean = mean(x)blah$theMedian = median(x)return (blah)
}MandM(sleepy$extra)
Oth A g t
8/10/2019 r Programming Life Sciences Aug 2009
95/120
Other Arguments
MandM(sleepy$baseline) It points out that we need to deal with missing
values. Look up mean and median and you will
see they allow the na.rm parameter to determineif missing values are dropped. Using MandM(sleepy$baseline, na.rm=TRUE)
does not work because the parameter list doesnot allow it. We want to allow that parameter tobe passed along. So rewrite the function.
M and M Again
8/10/2019 r Programming Life Sciences Aug 2009
96/120
M and M Again
Recall that an in the argument list means"other stuff".MandM = function(x, ...) {
blah = list(theMean=0, theMedian=0)blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)return (blah)
}MandM(sleepy$extra)MandM(sleepy$baseline, na.rm=TRUE)
Appling Your Function
8/10/2019 r Programming Life Sciences Aug 2009
97/120
Appling Your Function
R does allow you to write loops to iterate overrecords or variables but if you are not writingnovel math functions, they can generally be
avoided. R will try to vectorize and process:
MandM(c(sleepy$baseline,sleepy$extra))
Use sapply to apply a function to a data frame:sapply(sleepy, MandM, rm.na=TRUE)
Better M and M
8/10/2019 r Programming Life Sciences Aug 2009
98/120
Better M and M
MandM = function(x, ...) {blah = list(theMean=0, theMedian=0)if(is.numeric(x)== TRUE) {
blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)}return (blah)
}sapply(sleepy, MandM, na.rm=TRUE)
Yummy M and M
8/10/2019 r Programming Life Sciences Aug 2009
99/120
Yummy M and M
MandM = function(x, ...) {blah = list(theMean=NaN, theMedian=NaN)if(is.numeric(x)== TRUE) {blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)}return (blah)
}MandM(sleepy$extra)MandM(sleepy$baseline, na.rm=TRUE)
sapply(sleepy, MandM, na.rm=TRUE)
Writing Novel Functions
8/10/2019 r Programming Life Sciences Aug 2009
100/120
Writing Novel Functions
Look hard on rseek.org before you reinventthe wheel.
R syntax is very similar to C.
Select/Case logic is different (R short-circuits) . The R Book by Crawley is too big to buy for
just this topic but it is good for syntax. Get itfrom the library and read the early chapters.
The final chapter of Spector has a fewwonderful pages.
8/10/2019 r Programming Life Sciences Aug 2009
101/120
Destroying Efficiency
8/10/2019 r Programming Life Sciences Aug 2009
102/120
Destroying Efficiency
A matrix of data is really a vector with row andcolumn attributes added to it. This has profoundspeed issues if you add to the size of a matrix
because the data has to be shifted all over theplace. If you plan on writing your own functions to
manipulate matrices, build an empty matrix ofthe maximum size (or guess bigger) rather thanusing the functions to add rows or columns.
Writing Efficient Code
8/10/2019 r Programming Life Sciences Aug 2009
103/120
Writing Efficient Code
R has decent tools for profiling code. The Rprof and summaryRprof functions will
help you figure out what is bogging down your
code.Rprof()MandM(rnorm(1000000))
Rprof(NULL)summaryRprof()
8/10/2019 r Programming Life Sciences Aug 2009
104/120
Debugging in R
8/10/2019 r Programming Life Sciences Aug 2009
105/120
Debugging in R
See Chapter 9 in Gentleman's book. The browser() function can be put inside a function to
pause execution and see what is going on. The codetools package is great for tweaking big
functions:findLocals(), findGlobals(),
shows you if variables and functions originate inside of afunction
checkUsage(), and checkUsagePackage() shows you what variables are modified or not touched in a
function
Creating Graphs
8/10/2019 r Programming Life Sciences Aug 2009
106/120
Creating Graphs
Basic plots are easy but tweaking them forpublications can be rough because thedocumentation on the function arguments is
appalling. Data Analysis and Graphics Using R by John
Maindonald and John Braun is extremely useful. There are myriad graphics built into the core of R
plus more in the packages.addictedtor.free.fr/graphiques/thumbs.php
Test Scores
http://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.php8/10/2019 r Programming Life Sciences Aug 2009
107/120
Test Scores
scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE)rapply(scores, class)scores$CENTER = as.factor(scores$CENTER)scores$PAT = as.character(scores$PAT)
rapply(scores, class)scores$isSick = ifelse(scores$SCORE > 0, 1, 0);library(car)(scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 =
"Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BADDATA"')))
(scores$SEV = factor(scores$SEV, levels = c("None", "Mild","Moderate", "Severe"), ordered = TRUE));
Common Plots are Easy
8/10/2019 r Programming Life Sciences Aug 2009
108/120
Common Plots are Easy
attach(scores) #to avoid typing scores$plot(SEV, main = "MainTitle", xlab = "xlab", ylab =
"ylab")
plot(SCORE)hist (SCORE)boxplot(SCORE)boxplot(SCORE ~ SEX, ylim = c(0,100))detach(scores)
Graphics Tweaks
8/10/2019 r Programming Life Sciences Aug 2009
109/120
Graphics Tweaks
-3 -2 -1 0 1 2 3
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
Density = dnorm
z
P r o
b a
b i l i t y d e n s
i t y
-3 -2 -1 0 1 2 3
0 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Probability = pnorm
z
P r o
b a
b i l i t y
0.0 0.2 0.4 0.6 0.8 1.0
- 2
- 1
0
1
2
Quantiles = qnorm
p
Q u a n
t i l e ( Z )
Random numbers = rnorm
z
f r e q u e n c y
-4 -2 0 2
0
5 0
1 0 0
1 5 0
2 0 0
mfrow is used to setnumber of rows and
columns of graphics ona page
Strip Charts for Small Datasets
8/10/2019 r Programming Life Sciences Aug 2009
110/120
Strip Charts for Small Datasets
par(cex = 1.5) # big font with(Gad, stripchart(HAMA ~ DOSEGRP, xlab =
"HAMA", pch = 16))
20 25 30 35
H I
L O
P B
HAMA
3 Languages for the Price of 1
8/10/2019 r Programming Life Sciences Aug 2009
111/120
3 Languages for the Price of 1
The graphics I have shown use the classicgraphic methods.
There are trellis plots from the lattice package
that split the data into multiple panesautomatically.
ggplot2 uses a "grammar of graphics"
approach (like SPSS).
Dont play with pie!
8/10/2019 r Programming Life Sciences Aug 2009
112/120
Don t play with pie!
library(lattice) trellis.par.set(list(fontsize=list(points=20))) trellis.par.set(list(fontsize=list(text=25))) dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21))
Freq
HI
LO
PB
0 5 10 15 20
HI
LO
PB
DOSEGRP
The lattice package makestrellis graphics (I didnt makeup these names!).
EE
8 1216
EE EE
8 1216
EE EE
8 1216
EE EE
8 1216
EE EE
8/10/2019 r Programming Life Sciences Aug 2009
113/120
Compression Ratio
N O x
( m i c r o g r a m s
/ J )
1
2
3
4
8 1216
EE EE
8 1216
EE EE
8 1216
EE EE
8 1216
EE EE
8 1216
EE
Typical lattice plot with
banding to showsubsets
8/10/2019 r Programming Life Sciences Aug 2009
114/120
8/10/2019 r Programming Life Sciences Aug 2009
115/120
Basic plot + geometricdetails + adding details+ adding more details +
yet more details
qplot(carat, price, data = diamonds, geom= c("point", "smooth"))
Use Rcmdr (R Commander)
8/10/2019 r Programming Life Sciences Aug 2009
116/120
Rcmdr has A LOT of great graphics built intothe point and click interface.
library(Rcmdr)
Look up my short course (5 talks) coveringbasic statistics to see how to code manygraphics.
www.stanford.edu/~balise/HowToDoBiostatistics.htm
Use Rcmdr (R Commander)
You are Going to Need More Help
http://www.stanford.edu/~balise/HowToDoBiostatistics.htmhttp://www.stanford.edu/~balise/HowToDoBiostatistics.htm8/10/2019 r Programming Life Sciences Aug 2009
117/120
Data Manipulation with R by Spector. A must-have book on how to read and write data with or without SQL,
manipulate data with R, aggregate data, and reshape datasets easily.
R Programming For Bioinformatics by Gentleman. A very good intermediate level book on how R object-oriented
programming really works.
The R Book or Statistical Computing by Crawley. These have nicely written intermediate level statistics. But they are highly redundant across the two books.
Redundant
http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L252108&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L252112&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L260334&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=18/10/2019 r Programming Life Sciences Aug 2009
118/120
Biostatistics
8/10/2019 r Programming Life Sciences Aug 2009
119/120
Biostatistics
John Fox, the guy who made Rcmdr, is anexcellent author and he provides an R basedsupplement for his superb statitics book.
Spectrum
8/10/2019 r Programming Life Sciences Aug 2009
120/120
Spectrum
If you are doing biomedical research and havequestions we are here to help. Study design
Analysis plan Power and sample size calculation (Limited availability help with SAS and R code)
med.stanford.edu/spctrm/biostatistician.html
http://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.html