244
R basics Albin Sandelin

R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Embed Size (px)

Citation preview

Page 1: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

R basics

Albin Sandelin

Page 2: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Ultra-short R introduction

Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why?

Ease of use

- click buttons, select data by hand

-you see the data in front of you

-you can do limited programming

Page 3: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Disadvantages

• Hard to handle large dataset (>100 points)

• Inflexible, few analyses available

• “Dumb-down” click button effect

• Hard to repeat analyses systematically with new data

Page 4: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

R• R is a computational environment - somewhere

between a program and a programming language• No buttons, no wizards: only a command line interface• Is a professional statistics toolset - likely has all the

analyses you will ever need in your career• Is also a programming language• Can handle large datasets• Very powerful graphics• The state-of-the-art statistics program for

bioinformatics, especially array analysis• Free, and open source!

Page 5: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

R in this course

In this course, we will use R forExploring data

Select subsets of dataPlotting data

Statistical tests on data…etc

= almost everything.It is the most important tool in this course!

Page 6: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

R is hard

R is challenging to learn:No visual cuesVery different from spreadsheetsHard to find the method to useCryptic manual pages (at times)

It requires a lot of effort to get inside the system, but bear in mind that it is worth it - you gain an edge to all spread-sheet users

Page 7: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Reading material• Keep the reference sheet handy• Save code from lecture exercises in a text file• Powerpoint slides are available after lectures

at the web page• Can also recommend the tutorials on the

CRAN website – this is how I learned R• Dahlgaard book is very good, also for future

reference

Page 8: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Expected:

• You will have R open at all times during this course

• We will make MANY smaller exercises during class, most often in pairs

• Understanding these exercises is the key to succeeding in the exam

• We will have at least two teachers present at al lectures - use us to get help when needed

Page 9: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

•NB: It is YOUR responsibility to that you understand the examples/exercises -if not, you have to ask for clarification during the walkthrough of the problems

Page 10: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

First challenge of the day• Start R

• Type

demo(graphics)

• Hit enter a few times

Page 11: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Starting with R

We will now walk through the most basic R concepts

We will add successive detail on this – with many exercises - and in a while get over to plotting, basic statistics, and finally some programming

Page 12: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenging?

• Those who know R or have a math background might find most things for today no-so-challenging

• We cover this to keep everyone on the same ground

Page 13: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Vectors• In statistics, we are almost always dealing with several “data points”

• A vector is an collection of numbers and/or strings:

– (“albin”, “sanne”, “jens”)

– (1, 5.2, 10, 7, 2, 21)

– (3)

• The last example is a vector of length 1

Page 14: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

In R, we make a vector by the c() command (for concatenate)

> c(1,5,10, 7, 2, 1)[1] 1 5 10 7 2 1

> c('albin', 'sanne', 'jens')[1] "albin" "sanne" "jens"

When we input strings or characters, we have to surround them with “ or ‘

If making vectors of size 1, we can skip c()> 3 [1] 3

Method name Method arguments

Method output is also called “return value(s)”

Page 15: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge:1) Make the following vector45,5,12,10

2) What happens with the following command?c(1:100) c(50:2)

A vector is a data structure, and the most fundamental in R: almost everything in R is some kind of vector, although sometimes in several dimensions – vectors within vectors

Page 16: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The reference sheet is your friend!

• You will get over-whelmed by different command names fast

• Use the reference sheet to remind yourself in all exercises

Page 17: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Assignment to memory• The c() command is almost useless in itself - we want to keep the

vector for other analyses• The assignment concept:

> 4+5 # add 4 and 5[1] 9> a<-4 # store 4 as “a”> b<-5 # store 5 as “b”> a # just checking[1] 4> b[1] 5> a+b # add a+b (4+5)[1] 9 # yes, works

Anything after a # will be disregarded by R. Used for making comments

Page 18: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Expanding this to a whole vectormy_vector<- c(1,5,10, 7, 2)

Note that there is no “return value” now - this is caught by the “my_vector”. What happens if you type my_vector after this?my_vector is a variable, with the variable name my_vector. Variable names are totally arbitrary!The anatomy of the vector:

Name my_vector

Values 1 5 10 7 2

Index [1] [2] [3] [4] [5]

Page 19: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

We can access part of the vector like thismy_vector[5] # will give the 5th item in the vector

What happens if you do this?

my_vector<- c(1,5,10, 7, 2) # define the vector

my_vector [c(1,3,5)]

my_vector[1:4]

my_vector[4:1]

Page 20: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> my_vector[c(1,3,5)][1] 1 10 2> my_vector[1:4][1] 1 5 10 7

Making a series of integers:A:B will give a vector of A,A+1, A+2…B

>my_vector[4:1] # also works backwards[1] 7 10 5 1

Page 21: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge

• Using the reference sheet, figure out at least three ways of making R print your vector in the other direction - my_vector<- c(1,5,10, 7, 2) # define the vector

Page 22: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> my_vector[c(5,4,3,2,1)]

[1] 2 7 10 5 1

> my_vector[c(5:1)]

[1] 2 7 10 5 1

> rev(my_vector)

[1] 2 7 10 5 1

>

Page 23: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Interlude: naming rules and the danger of over-writing

Naming: We can name vectors to almost anything. The most basic rule is : Never start a vector name with a number

> a<- c(1,5,4,2) #OK> 1a<- c(1,5,4,2) # NOT OKError: syntax error> a1<- c(1,5,4,2) # OK

Over-writing:my_vector<- c(1,5,10, 7, 2)my_vector<- c(10,5,2, 3, 1)#what does my_vector contain now?

Page 24: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Analyzing vectors• Many functions work directly on vectors - most have

logical names• For instance, length(my_vector) gives the number of

items in the vector (= 5)• Challenge:

– Make a vector big_vector with values 1 to 10000 using the a:b construct

– What is the • Length of the vector• Sum of all items in vector: the sum() function• Mean(= average) of all items in the vector: the mean() function

Page 25: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> big_vector<-(1:10000)> length(big_vector)[1] 10000> sum(big_vector)[1] 50005000> mean(big_vector)[1] 5000.5

Page 26: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Interlude: the R help system• R has hundreds of functions where most have

a non-trivial syntax: The easiest way of retrieving documentation is by saying

?function_name #ORhelp(function_name) #ORhelp.search(“something”) #for ‘fuzzy’ searches• Look at the help for sample() and sort() and

then try them out on big_vector• Also try out RSiteSearch(“sample”) – what

happens

Page 27: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Adding and multiplying a number to a vector

Sometimes we want to add a number, like 10, to each element in the vector:

> big_vector +10Test this:big_vector2<-big_vector +10Also testmin(big_vector)max(big_vector)min(big_vector2)max(big_vector2)What happens?

Page 28: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> big_vector2<-big_vector +10> min(big_vector)[1] 1> max(big_vector)[1] 10000> min(big_vector2)[1] 11> max(big_vector2)[1] 10010

Page 29: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Adding vectors

• We can also add one vector to another vector• Say that we have the three vectors

A<-c(10, 20, 30, 50)

B<-c(1,4,2,3)

C<-c(2.5, 3.5)• Test what happens and explain the outcome:

A+B

A+C

Page 30: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

A+B is easy to understand : A[1]+B[1] , etc

A+C is trickier - the C vector is just of length 2. It is re-used! So

A[1]+C[1]=12.5

A[2]+C[2]=23.5

A[3]+C[1]=32.5

A[4]+C[2]=53.5

Actually, this is what is happening also with A+10 …the 10 is used many times.

> A+B[1] 11 24 32 53> A+C[1] 12.5 23.5 32.5 53.5

A<-c(10, 20, 30, 50)B<-c(1, 4, 2, 3 )C<-c(2.5,3.5 )

Page 31: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Plotting vectors

• Lets make up some semi-random data:dat<-rnorm (100) #draw 100 random, normal

distributed data points • Test the following:plot(dat)plot(dat,type=“l”)barplot(dat)hist(dat)• What is the difference?

Page 32: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Why are your three first plots different than mine?Why is your last plot more similar to mine?

Page 33: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Graph options• Generally, you can give extra options to graphical commands like

this• plot (some_vector, col=“blue”, type=“l”)• In plot: try to vary the following options - note that you can use

several at once (and figure out what they do)

type=“l”, “b”

col= “hotpink”

main=“important plot”

type=“h”

type=“S”

These options are really arguments to the plot() function

Page 34: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

More about functionsIn almost all cases, a function needs some input, like

plot(dat).

‘dat’ here is an unnamed argument, and this works because plot() assumes we mean x values = dat.

We could also say plot(x=dat) - a named argument. If you have many arguments, most of them are named - such as

plot (some_vector, col=“blue”, type=“s”)

The help pages will tell you what type of arguments you can use

Page 35: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The danger of unnamed arguments…

• …is that the order of them will make a big difference. Try this out - what is the difference between the plot commands?

a<-rnorm(100)b<-rnorm(100)*2plot(a,b)plot(b,a)plot(x=b, y=a)

Page 36: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

b

a

a

a

b

b

plot(a,b)

plot(b,a)

plot(x=b, y=a)

Page 37: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Some generic R arguments to plots - the par() function

The par() function is used to set general plot properties. It has hundreds of possible arguments - see ?par

Two very handy par() arguments is mfrow() and mfcol() - these will allow many plots in one page

You give these functions a vector of length 2 - this gives the number of cells in the page (see example)

Page 38: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Example:> par( mfrow=c(3,1) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

> par( mfrow=c(2,2) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

Page 39: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge - can you get the three plots in a row using

mfrow?

Page 40: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> par( mfrow=c(1,3) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

Page 41: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Overlaying plots

• Sometimes we want to put may data sets within one graph, on top of each other

• This is often made by the lines() or points() command, like this

Page 42: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> plot(b, type="l", col="blue")> lines(a, col="red")

Why did I start with plotting b?What would have happed if usingpoints() instead of lines()?

Page 43: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Sizing graphs

Simple concept, but awkward to write

Change X scale: xlim=c(start_value, end_value)

Change Y scale: ylim=c(start_value, end_value)

Page 44: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> plot(a, type="l", col="blue")> plot(a, type="l", col="blue", ylim=c(-5,5))

Page 45: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Saving graphs

• Different on different systems!

• All systems can use the original tricky way with device() – see introduction PDFs if you have an interest. We won’t use this.

Page 46: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

OSX

• Click on the graph, and just copy it. Will become a pdf or a bitmap when pasting.

• On my mac you have top open a new file in Preview and paste there first. On others you can paste it into other programs directly

• Or, just have the graphics window active and do file->save

Page 47: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Windows

Right-click on graph, copy as metafile or bitmap, paste

Page 48: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Linux

• Use awkward device() system, like pdf(…) or

• Use GIMP and just make a screen capture from there by “acquire”

Page 49: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Statistics part 1

• Describing the data: descriptive statistics

• What is the center point of the data?

• What is the spread of the data?

• How does the distribution look?

Page 50: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Summary statistics

• hist() (= Histogram) is a graphical way of summarizing distributions – it creates a number of “bins” and calculates how many of the data points that fall into each bin.

• We can also summarize by the center points in the data:– mean()

Page 51: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• median():– Sort all the data, pick the number in the center. If the

number of data points is even, take the mean of the two center points

• Challenge: • We make another vector dat2<-rnorm(10)And add a few extra points to itdat2<-c(dat2, 10, 10.5, 30 )• Test mean() and median() on dat2. Are they the same?

Can you explain the differences by plotting a histogram? What is the advantage/disadvantage of each measure?

Page 52: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> dat2<-rnorm(10)> dat2<-c(dat2, 10, 10.5, 30 )> median(dat2)[1] 0.11125> mean(dat2)[1] 3.636139> hist(dat2)

Means are sensitive to outliers! Very common situation in genomics…

Page 53: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Percentiles

• An extension of the median concept• Best explained by example:

– the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

• The median is the same as the 50th percentile

• The first quartile is the 25th percentile, the third is the 75th

• Try summary(dat) and summary(dat2)

Page 54: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

summary(dat) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.20600 -0.60480 -0.06930 -0.01333 0.68030 3.49600 > summary(dat2) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000

The command ecdf() (empirical cumulative distribution) calculates “all” percentiles in your data – and also understands plot()Try plot (ecdf(dat2))

Page 55: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Sorted values of dat2

What fraction of the data that has been covered at point X

Page 56: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Boxplots• An “easier” representation of ECDFs. Is based

on both making boxes that tell us about both center point and “spread” of the data

• First, calculate the first quartile, the median and the third quartile

• Calculate the “inter-quartile range” (IQR): 3rd quartile -1st quartile

• These will be used to draw a “box”>boxplot(dat)> rug(dat, side=2)

Page 57: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Median

3rd quartile

1st quartile

Interquartile range (IQR)

Page 58: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

…continued

• Sounds more complicated than it is:Any data observation which lies more than 1.5*IQR lower than the first quartile is considered an outlier. Indicate where the smallest value that is not an outlier is by a vertical tic mark or "whisker", and connect the whisker to the box via a horizontal line. Do the same for higher values

Page 59: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Median

3rd quartile

1st quartile

The data point that is highest but within 1.5 IQR from the center

More extreme data - outliers

Page 60: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 61: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Variance, standard deviation and data spread

• What is the difference between these distributions?

Page 62: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Same mean and median, but different spread over the x axis

• This can be measured by the variance of the data:

• It is basically the difference between each point and the mean, squared

Page 63: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Standard deviation is simply variance squared.

• This gives nice statistical features that you will get to know if you study statistics

Page 64: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Why is variance and standard deviation important?

• Variance tells you something about the quality of measurements

• The higher the variance, the harder it is to with certainty say that two measurements are different

• (whiteboard demo)

Page 65: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• What are the R functions for variance and standard deviation?

• If we make some random datasmallset<-rnorm(100)

largeset<-rnorm(10000)

What is the variance and standard deviation for these?

Is the standard deviation really the square root of the variance (what is the function for square root?)

Page 66: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

>smallset<-rnorm(100)

>largeset<-rnorm(10000)

>var(smallset)

[1] 0.9504884

>var(largeset)

[1] 0.98416

>sd(largeset)

[1] 0.9920484

>sd(smallset)

[1] 0.97493

>sqrt(var(smallset))

[1] 0.97493

Why do we get about the same variance?

Page 67: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 68: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Dealing with real data

• …involves interacting with messy data from other people

• Almost all interesting data is “real”

• We need methods to get the data into R to start with.

Page 69: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Reading data from files

• Typically, you have data in a .txt or .xls file - how do you get it into R for analysis?

• Four steps:– 1 Format it to tab-separated text file– 2 Tell R what directory to work in– 3 Read in the file– 4 Attach data to memory (optional)

Page 70: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

What R likes:• A “data frame”(=table) with columns.

The first row can be column names, like this:

Lecturer height age Poker_games_won

Albin 181 34 14

Jens 189 31 10

Sanne 170 NA 21

Name row, “header”

String data

First data row

Numerical dataCode for missing dataDo not ask ladies for age…

Page 71: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Reading data(1)

• General rules: – Use tabs as separators– Each row has to have the same number of

columns– Missing data is NA, not empty

• As .txt: The ideal format!

• As .xls: save as tabbed text

Page 72: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 73: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Get some data to play with:Update: old file is wrong!

Please download the data again• See course web page!

• http://people.binf.ku.dk/albin/teaching/htbinf/R/• Download the zipped file – save this at a good place that

you know where it is• I will refer to this place on your hard drive as “the data

directory”

Page 74: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Reading data(2)

• Tell R where to work.– In R, you always work in a given directory, called

“working directory”.

– You can set this graphically in Windows or MacOS (next pages)

– Or use setwd(“some directory”)

– In Linux, it pays to start R in the same directory

• It pays to have one directory per larger analysis, and call the data files informative things (not “Data.txt”)

Page 75: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

In OSXCurrent working directory Change working directory

Page 76: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

In Windows

Page 77: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Reading in data (3)

my_data<- read.table(“filename”, header=T)

Only if you have column names – I always have that

Data frame object holding all the data

The actual command

Page 78: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Lets try…

• In the data directory, there is a file “flowers.txt”. Have a look at it (it is a normal text file – can be opened in almost anything). Now:

my_data<-read.table(“flowers.txt”, header=T)

names(my_data)

What happens – what does “names” do?

Page 79: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> my_data<-read.table("flowers.txt", h=T) # note to self: show name completion feature> names(my_data)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

Sepal.LengthSepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosaEtc…

Page 80: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Accessing the columns

• There are three ways of getting the data in a data frame– Get the column by saying

dataframe_name$column_name– Get the columns as separate vectors, named

as their header - the attach command– More complex queries, using index

operations (more in a few slides)

Page 81: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Columns by names

• The attach(some_data_frame) function will put all the columns of some_data_frame as vectors in working memory

• So, in our case• attach(my_data)

• Will give the 5 vectors • "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" and "Species"

• (which we then can analyze)

Page 82: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Dangers of the attach() function

• Attach will make a vector out each column, named after the column name

• What happens if we already have a vector with that name?

• What does ls() and detach() do? Why would we want to do this?

Page 83: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge

• Using the flowers file, – Plot a histogram of the length of sepals (“Sepal.Length”)– Plot the correlation between sepal length and sepal

width (an x-y plot)

• What is the mean of Sepal.Length and Sepal.Width?– Why do you get strange results on Sepal.Length?– Can you solve the problem by using the appropriate

help file?

Page 84: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> hist(Sepal.Length)> plot(Sepal.Length, Sepal.Width)

Page 85: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> mean(Sepal.Width)[1] 3.057333> mean(Sepal.Length)[1] NA>

# strange!

If we look at the file:Sepal.LengthSepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosaNA 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa5.4 3.9 1.7 0.4 setosa4.6 3.4 1.4 0.3 setosa

Page 86: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Missing values

• In real data, we often have missing values. In R, we label these “NA”

• Any examples of missing data?

• Some functions can by default deal with these - for instance hist()

• Some functions will need to be told what to do with them, like mean()

Page 87: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> mean(Sepal.Length)[1] NA#na.rm is the option for mean to remove all NAs first > mean(Sepal.Length, na.rm=T)[1] 5.851007

With new files, you should check whether there are any NA values first.

Page 88: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

What if we wanted just the 5 first rows? Or just all values from the

species versicolor?Sepal.Length Sepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa6.2 2.9 4.3 1.3 versicolor5.1 2.5 3.0 1.1 versicolor5.7 2.8 4.1 1.3 versicolor6.3 3.3 6.0 2.5 virginica5.8 2.7 5.1 1.9 virginica7.1 3.0 5.9 2.1 virginica6.3 2.9 5.6 1.8 virginica

Page 89: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The index concept• Recap: • In a vector, we can select individual values by saying• Sepal.Width[some position]> Sepal.Width[5][1] 3.6 • Sepal.Width[some_positions]> Sepal.Width[2:10][1] 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1Sepal.Width[c(2,4,6,8)][1] 3.0 3.1 3.9 3.4

• The first position will always have the number 1, the last will have the length of the vector.

• Challenge: what is the mean of the first 100 values of Sepal.Width?

Page 90: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s code

• > a<-Sepal.Width[1:100]

• > mean(a)

• [1] 3.099

• # or, in just one line

• > mean(Sepal.Width[1:100])

• [1] 3.099

Page 91: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Selecting parts of vectors depending on values

• Often, we want parts of vector due to the values of the vector -for instance all values <10, or all strings equaling “versicolor”.

• This is made by a slightly odd syntax(it will make sense later on when we have many vectors)

• To get all values over 10 in Sepal.Length

Sepal.Length[ Sepal.Length >10 ]• So, we are using a “criterion” instead of the index

Page 92: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Try out

• Make a boxplot of all values in Sepal.Length that are higher than 5

• How many vales are there that satisfy this criterion?

Page 93: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s code

a<-Sepal.Length[Sepal.Length >5 ]boxplot (a)length(a)

[1] 119

Page 94: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

criteria in R• < Smaller than• > Bigger than• >= Bigger or equal to• <= smaller than or equal to• == equal to (not “=“)• != not equal to

• Challenge: how many rows for the virginica species?

Page 95: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> length(Species[Species==virginica] )Error in NextMethod("[") : object "virginica" not found

# what happens here is that R thinks virginica is an object - a vector or something else

> length(Species[Species=="virginica"] )[1] 50

For string comparisons, we need to enclose the string with “ ” or ‘’

Page 96: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Selection using two vectors

• Often we have two vectors from a file, and we want to use the value of one vector to select values from the other.

• For instance, sepal length broken up by species

Sepal.Length Sepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa6.2 2.9 4.3 1.3 versicolor5.1 2.5 3.0 1.1 versicolor5.7 2.8 4.1 1.3 versicolor6.3 3.3 6.0 2.5 virginica5.8 2.7 5.1 1.9 virginica7.1 3.0 5.9 2.1 virginica6.3 2.9 5.6 1.8 virginica

Page 97: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Use the same syntax as before - but change the vetor names!

• some_vector[ other_vector == something]• Challenge: • select the Sepal.Width values that

correspond to the setosa species• What is the mean of these values?• What is the distribution of these values?

Page 98: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s code> a<-Sepal.Width[Species=="setosa"]> mean(a)[1] 3.428> hist(a)

Page 99: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Fancy way of breaking up data (only works with some

functions)• boxplot (Sepal.Width~Species)

• Try it out…

• This is a “formula” construction - break up width in the levels of species

Page 100: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Even more complex: multiple criteria

• & “and”: combine two criteria• | “or”: either this or that has to be trueSo, Species[ Sepal.Width >4 & Sepal.Length <5] will give the species rows where width >4 AND Length is <5.

…and, Species[ Sepal.Width >4 | Sepal.Length <5] will give?

Challenge: Make a boxplot of Sepal.Width where Sepal.Length >3 AND the species is virginica

Page 101: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

>boxplot (Sepal.Width[Sepal.Length >3 & Species=="virginica"] )

Page 102: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

More complex data selections

• The dataframe is really a two-dimensional vector - a vector that contains vectors

Page 103: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• We can use almost the same commands to select sub-sets of the data frame, without using the attach() function

• This is very handy at times

Page 104: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• The data frame is really a big table - a 2D vector.• Just as some_vector[5] gives use the fifth value in a

vector, • some_dataframe[10, 2] gives us the cell in the 10th

row and the 2nd column

>my_data[10, 2]

[1] 3.1

Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa5.4 3.9 1.7 0.4 setosa4.6 3.4 1.4 0.3 setosa5.0 3.4 1.5 0.2 setosa4.4 2.9 1.4 0.2 setosa4.9 3.1 1.5 0.1 setosa5.4 3.7 1.5 0.2 setosa4.8 3.4 1.6 0.2 setosa4.8 3.0 1.4 0.1 setosa

Page 105: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Selecting rows• If we instead say• my_data[10, ]• We will get the WHOLE 10th row back• One can think of it as saying

my_data[10, “give me everything”]

• So, what will happen if we say

my_data[1:5, ]• Or

my_data[5:1, ]• Try it out…

Page 106: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Indexing exercise

• In the data directory, there is a file called Rindexing.pdf – it is a set of small exercises designed to help you understand indexing

• It works with a very small data file, just so that you can check your results by eye

• Indexing is very important for the rest of the course

Page 107: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Nota Bene: 1)Students on the reserve list

• Please talk to me to get signed on to the course (I need your student number or similar)

• Otherwise you cannot go to the exam => no credit

• 2) A watch was found in the last lecture – will be given back if you can describe it well

Page 108: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Sorting one vector by another

• Very common situation with experimental data – what are the genes with the highest expression in my data set?

• Recap: what happened if you said

my_data[1:5, ]• Or

my_data[5:1, ]• ?

Page 109: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

# get some data to play with

d<-read.table("mtcar.txt", h=T)

attach(d)

Just by looking at the data,

d;

what car has the highest and lowest horsepower (=hp)?

Page 110: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• car_name mpg cyldisp hp drat wt qsecvs am gear carb• 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4• 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4• 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1• 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1• 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2• 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1• 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4• 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2• 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2• 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4• 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4• 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3• 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3• 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3• 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4• 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4• 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4• 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1• 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2• 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1• 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1• 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2• 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2• 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4• 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2• 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1• 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2• 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2• 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4• 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6• 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8• 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Page 111: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Permutations

sort.list(x) will give the permutation of x that sorts x

Is much easier to explain with a small example…

> a<-c(3,1,10)

> sort.list(a)

[1] 2 1 3

Page 112: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> a<-c(3,1,10)

> sort.list(a)

[1] 2 1 3

> b<-sort.list(a)

> a[b]

[1] 1 3 10

sort.list(hp)

[1] 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31

Page 113: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• car_name mpg cyl disp hp drat wt qsec vs am gear carb• 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4• 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4• 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1• 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1• 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2• 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1• 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4• 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2• 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2• 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4• 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4• 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3• 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3• 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3• 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4• 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4• 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4• 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1• 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2• 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1• 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1• 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2• 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2• 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4• 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2• 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1• 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2• 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2• 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4• 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6• 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8• 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31

Page 114: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Thought exercise

What happens if we now say

hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)]

?

Page 115: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

We get hp sorted…

hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)]

[1] 52 62 65 66 66 91 93 95 97 105 109 110 110 110 113 123 123 150 150 175 175 175 180 180 180 205 215 230 245 245 264 335

Page 116: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

So, sort.list really gives an instruction how to sort

So, we can also say

sort_order<-sort.list(hp)

hp[sort_order]

What will happen if we say

car_name[sort_order]

?

Page 117: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

car_name[sort_order]

[1] Honda Civic Merc 240D Toyota Corolla Fiat 128 Fiat X1-9 Porsche 914-2 Datsun 710

[8] Merc 230 Toyota Corona Valiant Volvo 142E Mazda RX4 Mazda RX4 Wag Hornet 4 Drive

[15] Lotus EuropaMerc 280 Merc 280C Dodge Challenger AMC Javelin Hornet Sportabout Pontiac Firebird

[22] Ferrari Dino Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial

[29] Duster 360 Camaro Z28 Ford Pantera L Maserati Bora

32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 Chrysler Imperial Datsun 710 Dodge Challenger Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E

So, we use the hp permutation to reorder the cars!

We can also do this with the whole dataframe, in the same way

>d[sort_order,]

Page 118: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> d[sort_order,]

car_name mpg cyl disp hp drat wt qsec vs am gear carb

19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

Page 119: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

What type of data are we looking at?

• Type here refers to what kind of data representation - for example “strings” or “numbers”

• R usually does a good job at interpreting this for us when we read data

• This becomes important later on, as some functions only take vectors of a certain data type - for instance, you cannot take the mean() of a vector with “words”!

Page 120: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Simple types

• Numbers:num_vector<- c(1,5,2,10)

• Strings(or characters)String_vector<-c( “a”, “trial1”)

• How can we know?

class(num_vector)

[1] "numeric"

Page 121: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Factors/groups• Factors: slightly more tricky. A factor is often a

grouping variable - such as the gender or experiment group, like

gender<-c(‘f’, ‘f’, ‘m’, ‘m’, ‘f’)experiment_number<-c(1,2,4,5,5,6)• Problem here is that R will think that the first vector is

a string vector, and the last a numerical vector. > class(gender)[1] "character"> class(experiment_number)[1] "numeric"

Page 122: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• This is resolved by

• > experiment_number<-as.factor( c(1,2,4,5,5,6))

• > class(experiment_number)

• [1] "factor"

Page 123: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Over to analysis of data!

boxplot (Sepal.Width~Species)

Is there a real difference?

Page 124: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

We now move over from “pure R” to “using R to make

statistical tests” • Focus is: what typical tests are there

and how to use them

• We skip almost all statistical theory(!) - covered in the course in fall

Page 125: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Population and sample• Population refers to a set of potential measurements

or values, including not only cases actually observed but those that are potentially observable. Really depends on the question at hand!

• Sample is the data we actually have - generally this is very small compared to the whole population, but we use it to draw inferences about the population

• So, the sample is a small part of a much larger population, in most cases

• Can sample==population? Examples?

Page 126: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Significance tests• In real life, we always get some difference

between datasets, because we sample limited data, and there is always noise

• So, is the difference in sepal width we see between species something that is just due to “luck” - or is it real?

• Challenge: What three features in the dataset will be important for finding this out (significant difference or just difference due to random noise) ? Discuss with your sideman for a minute

Page 127: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s take

• The observed difference– Obviously, the bigger observed difference, the less

random it is

• The data depth– The more data you have, the less likely it is that

strange things happens due to random effects

• Data spread (variance)– Slightly less intuitive. The less spread, the less it

can be explained by random events.

Page 128: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The concept of null hypothesis and P-values

• Always in statistics, we set a null-hypothesis. This is basically what we want to disprove!

• Usually, this is something like “there is no difference in mean in these two samples”. The alternative hypothesis is then, by default, “There is a difference”

• Then, we test how likely this is, with certain assumptions (depending on what test)

Page 129: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• The P-value of a test can be interpreted as the chance that the observed difference occurs if the null hypothesis is true– This only makes sense if your test is relevant!

• So, a small P-value means that the null hypothesis is not a good model - which means that there is a difference

• We often want P-values to be as small as possible - the standard cutoff for “significance” is 5%, or 0.05

• However, this is just tradition – it is completely arbitrary

Page 130: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Two-sample t-test

• [deliberate skip of math details – Jens will cover some later on]

• The t-test tests if two distributions come from distributions with the same mean.

• The assumption is that both distributions are “normal-shaped” - we should always check this first

Page 131: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Mean

Standard deviation – a measure of spread

Page 132: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Normality - brief check• There are many tests for normality. Now

we will do the simplest: by eye. Make a histogram and see if the data looks reasonably “bell-shaped”

• Challenge: Test to make histograms for the sepal length of each species

Page 133: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s take

#warning: wrong approach

hist(Sepal.Length)

# not very normal-shaped!

#why is this the wrong approach?

Page 134: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s take, again#now for each species

par (mfrow=c(1,3))

hist(Sepal.Length[Species=="setosa"])

hist(Sepal.Length[Species=="versicolor"])

hist(Sepal.Length[Species=="virginica"])

#IMHO, these are not very far away from normal-shaped given the amount of data

Page 135: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Slightly more advanced way – the quartile-quartile plot

• If two distributions have the same shape, their cumulative distributions should match well

4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(Sepal.Length)

x

Fn

(x)

-2 -1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(rnorm(50))

x

Fn

(x)

Page 136: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

-2 -1 0 1 2

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

qqnorm(Sepal.Length)

Page 137: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

-2 -1 0 1 2

4.5

5.0

5.5

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

qqnorm(Sepal.Length[Species=="setosa"])

Page 138: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

So, are they different?> setosa<-Sepal.Length[Species=="setosa"]> versicolor<-Sepal.Length[Species=="versicolor"]

> t.test(setosa, versicolor)

Welch Two Sample t-test

data: setosa and versicolor t = -10.521, df = 86.538, p-value < 2.2e-16 #Very!alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.1057074 -0.7542926 sample estimates:mean of x mean of y 5.006 5.936

Page 139: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Lets test it with some real data

In your data directory, there are two files with binding sites lengths, from estrogen receptor alfa and beta: ERA_widths.txt, ERB_widths.txt

1) Read these into two different vectors, called era, erb

2) Have they significantly different means?

Page 140: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albins code> d<-read.table("ERA_widths.txt", h=T)> attach(d)> d<-read.table("ERB_widths.txt", h=T)> attach(d)>

> t.test(ERA_width, ERB_width)

Welch Two Sample t-test

data: ERA_width and ERB_width t = 2.7288, df = 1066.975, p-value = 0.00646alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 11.18740 68.45565 sample estimates:mean of x mean of y 514.1370 474.3155

Page 141: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Hey, we forgot to check for normality!

> par(mfrow=c(1,2))> hist (ERA_width)> hist (ERB_width)# NOT really normal shaped! t.test is not valid!

Page 142: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The two-sample Wilcoxon test

• The t-test is assuming that samples are normal-distributed, which is often not true with biological data

• The wilcoxon test : wilcox.test() is a less powerful test which does not make such assumptions.

• Tests of this kind are called non-parametric• Use when the normality assumption is

doubtful

Page 143: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge

• Repeat the analysis of the ERA/beta sites using wilcox.test. Is the difference significant?

Page 144: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> wilcox.test(ERA_width, ERB_width)

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551alternative hypothesis: true location shift is not equal to 0

The “AHH, so close to significance” situation

Page 145: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Different types of alternative hypothesis

Null hypothesis:“there is no difference”

Default alternative hypothesis: “there is difference”. This is called “two-sided”, as the differences can be in either direction

Optional: We can make the alternative hypothesis “one-sided” - like “There is a difference - distribution A has a greater mean than distribution B”. This increases the statistical power *2

Page 146: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

In R> wilcox.test(ERA_width, ERB_width)

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551alternative hypothesis: true location shift is not equal to 0

> wilcox.test(ERA_width, ERB_width, alternative="greater")

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.02776alternative hypothesis: true location shift is greater than 0

# what happens if alternative=“less“?

Page 147: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> wilcox.test(ERA_width, ERB_width, alternative="less")

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.9723alternative hypothesis: true location shift is less than 0

Page 148: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Dangers with “hypothesis fiddling”

• This should never be used to get “twice as good P-value” without some justification

• Homework for Tuesday: read BMJ 1994;309:248 (linked at the web page) for discussion on when this is appropriate

Page 149: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Testing many samples at once

• Sometimes, we have a table of groups, and we want to see if any of the groups are different in means

treatment1 treatment2 treatment3plant1plant2plant3etc

• This is the typical one-way Anova test situation

Page 150: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

boxplot (Sepal.Width~Species)

Is there a real difference?

Last time we checked this, we just tested setosa vs versicolor

Page 151: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

In R, oneway anova tests can be made easy by the oneway.test() command

The oneway.test() works by giving it a formula, in our case

Sepal.Width~Species. This will basically give the function a “table” with

three columns of values, one for each species. It will then test if the means of any of these columns

are different from the others

Page 152: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Test flowers set> d<-read.table("flowers.txt", h=T)> attach(d)oneway.test(Sepal.Width~Species)

One-way analysis of means (not assuming equal variances)

data: Sepal.Width and Species F = 45.012, num df = 2.000, denom df = 97.402, p-value = 1.433e-14

As expected, very, very significant.Is the difference in length of sepals (not widths) even more significant? Test it!

Page 153: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Non-parametric alternative to Anova

As with t-tests, Anova tests assume that each of the columns has a normal-like distribution. This is often not the case

The Kruskal-Wallis test is analogous to the wilcoxon test - a non-paramatric variant. Practically, it works exactly in the same way as the oneway.test().

Page 154: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

2 data sets >2 data sets

Normal distribution

t.test() oneway.test()

Not normal distribution (non-parametric tests)

wilcox.test() kruskal.test()

Page 155: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Larger exercise

In the data folder, there is file gene_lengths.txt, showing gene lengths from three different chromosomes.

Is there a significant difference in the center points (means or medians) of gene lengths for the different chromosomes? What is the method of choice? What two chromosomes have the most/least significant difference?

Page 156: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> d<-read.table("gene_lengths.txt", h=T)> attach(d)> names(d)[1] "id" "chr" "gene_length”#first - is the gene length distribution normal-shaped?> hist(gene_length)# nope!# => nonparametric tests # needed

Page 157: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

# many more than two groups, # and non-parametric test needed =>kruskal wallis test> kruskal.test(gene_length~chr)

Kruskal-Wallis rank sum test

data: gene_length by chr Kruskal-Wallis chi-squared = 131.7681, df = 2, p-value < 2.2e-16# very significant difference!

Page 158: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

#are all chromosome pairs significant?# test case: use wilcox.test on two chromosomes> wilcox.test(gene_length[chr=="chr2"] , gene_length[chr=="chr3"])

Wilcoxon rank sum test with continuity correction

data: gene_length[chr == "chr2"] and gene_length[chr == "chr3"] W = 1068584, p-value = 0.1822alternative hypothesis: true location shift is not equal to 0 # so, all pairs are not significantly different!

Page 159: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

chr1 chr2 chr3

chr1 1 2.2E-16 2.2E-16

chr2 1 0.1822

chr3 1

Page 160: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

#for us who are lazy - we can do all vs all tests like this:

> pairwise.wilcox.test(gene_length, g=chr )

Pairwise comparisons using Wilcoxon rank sum test

data: gene_length and chr

chr1 chr2chr2 <2e-16 - chr3 <2e-16 0.18

P value adjustment method: holm

Page 161: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Small homework I: Reading about one-and two-sided testsWhat is horribly wrong in this (published) study?

Page 162: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Small homework (II)

What is the statistical problem with making these pairwise tests en masse, for instance making 1000 t-tests and reporting all cases where P<0.05?

Hint: Go back to the slides about what the 5% P value cutoff really means

Page 163: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Homework and ExaminationBig homework for examination goes online this evening!

Page 164: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Examination• Three home works during the course

– These will together give 50% of the final points– Main focus here is technical skill (how to do analysis x)– Theoretically optional. Very strict deadlines.

• One large home work at the end of the course– Will give 50% of final points– Overlaps substantially with the three first, but is more high-level– Main focus here is to merge technical skill with biologically

motivated analysis - real science• Do the math…the three smaller home works pays off

• Note: High level of activity needed from the start of the course. Will be demanding! The home works are the examination!

Page 165: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• The points with homework– Encourage study during the course

– It is easy to think that you understood something, but when meeting the real thing…

– Only realistic way of training– Evaluation

Page 166: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Rules for homework

• Working together in 2-3 perons teams is allowed and encouraged, BUT– Each person hands in a separate report– …which must be unique. This is usually not a

problem if you write it yourself…– If you work with others, write who you work with on

the report (“ I worked with X and Y in this homework!)

• Cases of copying between students WILL be reported. The BEST thing that can happen is that you will get lower scores

Page 167: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 168: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 169: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Using material from others

• Copying text/images from other sources – including fellow students – is considered cheating

• You can quote sources, given that you cite them, but this has a limit

Page 170: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Questions?

Page 171: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Correlations• Sometimes, we do not want to show that two

things are different, but that they are correlated.• For instance, weight and height for individuals

are clearly different, but also highly correlated• This is a positive correlation - but we could also

have a negative correlation• Correlations are dangerous things - it depends

on how we ask questions .Easily the most misused and misunderstood part of statistics

• Correlation != Causation

Page 172: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Consider : Why is it that…

Truth: Compared to the general population, there is a strong correlation between being an American football player and risk for testicle cancer

Interpretation: American football leads to cancer!

Page 173: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Consider : Why is it that…

Truth: There is an inverse correlation between yearly income and running speed

Interpretation: carrying money makes you slow!

Page 174: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Time to exploreOpen the height.txt file in R, and look at

the different data types. Try to plot various columns against each other, like

plot (mean_height, Year)

What columns correlate? Can we break up the data even more?

Try it out!

Page 175: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

>plot (mean_weight, year )> plot (mean_height, year )> plot (mean_height, mean_weight )

Hmmm…obviously there are TWO datasets here! How can we break this up?

Page 176: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Visualize the groups with colorplot (mean_weight, year )lines (mean_weight[gender=="F"], year[gender=="F"], col="red" )lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" )# a typical case of OVER-plotting

Page 177: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Assessing correlations

• First: plot a standard x-y plot

• Second, make a correlation test:

• cor(x, y) gives the Pearson correlation coefficient between z an y, which goes from -1(perfect negative correlation) to +1(prefect correlation). 0=no correlation

Page 178: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Test it…

> cor(mean_weight, year)

[1] 0.1688499

#BAD correlation - close to zero

What happens if we break it up by gender as in the last plot? Test it!

Page 179: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> cor (mean_weight[gender=="M"], year[gender=="M"] )[1] 0.9527568> cor (mean_weight[gender=="F"], year[gender=="F"] )[1] 0.931249

# very, very correlated!

Page 180: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Important: Pearson correlation gives a “score”, not a significance value. Obviously, it is easier to get a good correlation if you only have a few values!

Often, we want to also want to TEST if the correlation is there. The null hypothesis is then 0 correlationThis is done with thecor.test()command

Page 181: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> cor.test (mean_weight, year )

Pearson's product-moment correlation

data: mean_weight and year t = 0.5934, df = 12, p-value = 0.5639alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3973253 0.6419208 sample estimates: cor 0.1688499

Page 182: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> cor.test (mean_weight[gender=="F"], year[gender=="F"] )

Pearson's product-moment correlation

data: mean_weight[gender == "F"] and year[gender == "F"] t = 5.7147, df = 5, p-value = 0.002293alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5965244 0.9900208 sample estimates: cor 0.931249

Page 183: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Note that cor and cor.test are not the same thing

• cor gives a co-efficient – how well is the data described by a linear correlation

• cor.test tests if that co-efficient is 0 or not. If you have enough data, you don’t need a nice cor value to get a small p-value. So cor.test doe not test if the correlation is linear or not

Page 184: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Making a simple linear model fit

• If there is a correlation, we often want to fit a simple model to it – the classical Y=kx+m

• R does this by the lm() function, which also can do a lot more complex things

Page 185: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Here, we need to use the formula construct that we used before, solm (y~x) # make a model where x describes the outcome yThis will return a lm “object”, which holds all kinds of information - the model, how well it fitted the data, etc

Page 186: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Some caveats

Here, we deal with linear regression - one of the simplest usages of lm()

lm() can do a lot of other funky things that are “advanced manual”

Page 187: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Some necessary math

In linear regression, we are fitting the response (y) to the data(x) by a linear function

Y=kX+m

Where k is the “slope”, and m the intercept (high school math!)

Page 188: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

How to make a linear regression between mean weight and years, only females

lm( year[gender=="F"] ~mean_weight[gender=="F"] )Call:lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])

Coefficients: (Intercept) mean_weight[gender == "F"] 1616.551 6.297

Y=kX+m

Page 189: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Plotting a fit

# first, plot the actual data: plot (mean_weight[gender=="F"] , year[gender=="F"] , col=“red”)# x y# then make the fit - note the switch of year and mean_weight! my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )# y xabline(my_model, col=‘red’)

Page 190: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge…

We made this plot previously - weight vs year (females in red, males in blue)Now, make one linear fit for each gender and display them in the same figure (example on next page)

Page 191: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of
Page 192: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

attach(d)> plot (mean_weight, year )> lines (mean_weight[gender=="F"], year[gender=="F"], col="red" ) # will over-write plot> lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" ) # will over-write plot> my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )> females_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )> males_model <-lm ( year[gender=="M"] ~ mean_weight[gender=="M"] )> abline(females_model, col="orange" )> abline(males_model, col="cyan" )

Page 193: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

How good is a fit?

A whole science in itself!Here, we will focus on R2:R=Pearson Correlation coefficient from before

(-1 to +1)R2 is just R squared (so, from 0 to 1).We can access these, and a LOT of other

things, by saying summary(my_fitting)

Page 194: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

A horrible amount of data…Call:lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])

Residuals: 1 2 3 4 5 6 7 -2.3856 2.4661 -4.0161 -0.7568 -1.2754 0.7246 5.2432

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1616.551 66.311 24.378 2.17e-06 ***mean_weight[gender == "F"] 6.297 1.102 5.715 0.00229 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.409 on 5 degrees of freedomMultiple R-Squared: 0.8672, Adjusted R-squared: 0.8407 F-statistic: 32.66 on 1 and 5 DF, p-value: 0.002293

Page 195: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Count statistics

• Often, we have ONLY categorical data, such as “number of genes of category Y”

• Typically, these situations is where you have table where all the cells adds up to all cases or experiments

• This is called a contingency table, or a matrix

Page 196: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Matrices• A matrix is almost the same as a data frame -

a table• Only difference is that all data are only ONE

DATATYPE - only numbers, only characters or only logicals

• Here, we will mostly deal with numbers• How to create a matrix from some values:

– matrix(c(2,10, 5, 8,7,4), nrow=2)

Number of rows

Page 197: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

matrix(c(2,10, 5, 8,7,4), nrow=2) [,1] [,2] [,3][1,] 2 5 7[2,] 10 8 4

By default, fills in the matrix column by column…

> matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) [,1] [,2] [,3][1,] 2 10 5[2,] 8 7 4

But can do it by row also!

Page 198: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Want an extra row or column?

• The cbind() command adds an extra column to a matrix. Very handy at times! It can be understood as “bind a column to my matrix”. So, if we want to add another column to our matrix:

> m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) # just as before> m [,1] [,2] [,3][1,] 2 10 5[2,] 8 7 4> m2<-cbind(m, c(9,7))> m2 [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7

Page 199: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge• 1• Figure out what the rbind() command does compared to cbind()• Using these two commands, go from this matrix[,1] [,2] [,3] [1,] 2 10 5 • [2,] 8 7 4

• To this [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5

•2•If your final matrix is called m3, test the following:•rbind(m3, 1);•rbind(m3, 1:3)•What is happening and why?

Page 200: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Albin’s try

• m2<-cbind(m, c(9,7)) # add the column=a vector of size 2

• > m2

• [,1] [,2] [,3] [,4]

• [1,] 2 10 5 9

• [2,] 8 7 4 7

• > m3<-rbind(m2, c(2,1,4,5)) # add a row - this is now of size 4, since we added a column above

• > m3

• [,1] [,2] [,3] [,4]

• [1,] 2 10 5 9

• [2,] 8 7 4 7

• [3,] 2 1 4 5

Page 201: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Cont…

> rbind(m3, 1); [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5[4,] 1 1 1 1> rbind(m3, 1:3) [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5[4,] 1 2 3 1Warning message:number of columns of result

is not a multiple of vector length (arg 2) in: rbind(1, m3, 1:3)

Dangerous feature of many R commands - the vector you supply is reused if possible!! In example 1, it is used times. In example 2 is is used 1.33 times (hence the warning)

Page 202: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Indexes in matrices

• Works exactly as in data frames!

• m2[x, y] gives the cell in the x row and y column

• m2[, y] gives the whole column y

• Etc…

Page 203: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

We can also add annotation to the matrix…

Often useful to give rows and columns names, similarly to a dataframe

> m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T, dimnames = list(c("first_player", "second_player"), c("score1", "score2", "score3")))> m score1 score2 score3first_player 2 10 5second_player 8 7 4

The list() construct we have not used so far -view it just as syntax at this stageThis matrix is called a 3x2 contingency table

Page 204: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

2x2 contingency table - a common situation in biology

• Say that we have two types of genes, and two types of annotation features, for instance “related to cancer”. Is one of the groups more cancer-related?

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 6 4

A (40%) vs B(60%)…but the B group is very small. Are the proportions significantly different - could this arise just by chance?

Page 205: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Important!

• Contingency tables must have MUTUALLY EXCLUSIVE cells!

• As in all other test, we are assuming that samples are independently drawn!

Page 206: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Fishers Exact Test• Tests “proportions” in a 2x2 table by permutations

• Actually, the null hypothesis is “no relation between

rows and columns”my_matrix<-matrix(c(40, 60, 6, 4), nrow=2, byrow=T)> fisher.test(my_matrix)

Fisher's Exact Test for Count Data

data: my_matrix p-value = 0.3152alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.08719973 2.02538340 sample estimates:odds ratio

0.4478183

NOT significant…Maybe the size of the B set is too small?

Page 207: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Test the following tables

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 60 40

Cancer-related Not cancer-related

Genes of type A 5100 4900

Genes of type B 4900 5900

Page 208: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Test the following tables

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 60 40

Cancer-related Not cancer-related

Genes of type A 5100 4900

Genes of type B 4900 5900

P=0.00706

P<2.2 E-16

Page 209: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Bonus take-home message

• Significance is due to– Actual signal (like 60% vs 40%) strength– How much data you have

• So, you can get significance by– Having big signal on small data– Having weak signal but lots of data– Or some combination

• This is an issue with genome-wide data, where you often have TONS of data

Page 210: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Going from typical vectors to contingency tables

• Say that we have typical data files with categorical data (“female”, “male” )

Gender Car_ownerM NF YM YF YM NM YM Y• We often want to reshape this into a contingency table: how many

males have a car, how many do not, etc

Owns car(Y) Does not own car (N)

Female(F) 2 0

Male(M) 3 2

Page 211: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The table() function

…makes this very easy

> gender<-c("M", "F", "M", "F", "M“, “M”, “M”)> car_owner<-c("N", "Y", "Y", "Y", "N“, “Y”, “Y”)>table(gender, car_owner)# important: gender and car should not be continuous values

car_ownergender N Y F 0 2 M 2 3

Page 212: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The chi-square test

• chisq.test()• Similar to Fisher Exact test, but less

computationally intensive: we can use if for any size of a matrix.

• Sensitive to small counts, especially zeroes - will warn about this

• Almost indentical in usage to fisher.test()

Page 213: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Let’s add some dimensions…

Is the blood group distribution significantly different between Europeans and Africans?

Test this with both fisher.test and chisq.test

Blood group Africa Europe

A 58 90B 34 16AB 9 80 100 86

Page 214: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

# horrible mistake ahead…can you spot it?> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), nrow=2 )

> chisq.test(a)

Pearson's Chi-squared test

data: a X-squared = 194.352, df = 3, p-value < 2.2e-16

Page 215: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), ncol=2 )> a [,1] [,2][1,] 58 90[2,] 34 16[3,] 9 8[4,] 100 86> chisq.test(a)

Pearson's Chi-squared testX-squared = 14.5091, df = 3, p-value = 0.002288

data: a

Page 216: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> fisher.test(a)

Fisher's Exact Test for Count Data

data: a p-value = 0.002048alternative hypothesis: two.sided>

Page 217: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

When to use Fisher/Chi?

• Fisher exact test is “exact” - it tests all possible permutations, while chi-squre is an approximation

• Chi-square has problems with 0 count bins• Chi-square is less computationally

demanding - fisher will run out your memory if the set is too large

Page 218: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Some handy functions on matrices

Apply()Applies a function to all rows or columns

in a matrix: works like thisapply(X, MARGIN, FUN, ...)X is the matrixMargin is rows(1=rows, 2=columns,

c(1,2)=both)FUN is the function, like “mean”

Page 219: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Some examples> a [,1] [,2][1,] 58 90[2,] 34 16[3,] 9 8[4,] 100 86

> apply(a, 2, sum)[1] 201 200

> apply(a, 1, sum)[1] 148 50 17 186

Note that we lost the order between columns here!

Column-wise

Row-wise

> apply(a, 2, sort) [,1] [,2][1,] 9 8[2,] 34 16[3,] 58 86[4,] 100 90

apply(a, 1, sort) [,1] [,2] [,3] [,4][1,] 58 16 8 86[2,] 90 34 9 100

Page 220: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Writing custom functions in R

• As you have found out, sometimes we have to write a lot in R to get things done

• We often save the things you write in a separate file and do copy paste

• Sometimes this gets awkward – especially if you are repeating something very many times with small changes

Page 221: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Example 1

• Say that we have a 30 experiments, where each is a vector in a dataframe

• We want to have the mean, max and min value of each of those, and plot the distribution of the values

• That’s a LOT of typing• Instead, we’ll write a function that does

this for us

Page 222: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

• Defining your own function in R– Lets make a function that calculates

X2 +x+1 for any given x

my_function<- function(x) {

x^2 + x + 1

}

Function nameTelling R that we define a function

Argument to the function

The “code” of the function – anything within the { }

Page 223: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

#define it

my_function<- function(x) {

x^2 + x + 1

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

Page 224: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

challenges

1) Remake the function so that it calculates x3+x2+x

2) Make a new function that calculates the sum 1+2+3…etc up to x

3) Remake this function to calculate the sum from y to x (where y also is an input)

Page 225: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> my_function<- function(x) { x^3 +x^2 + x + 1 }> # run it> my_function(10)[1] 1111

> my_function<- function(x) { sum(1:x) }> # run it> my_function(10)[1] 55

> my_function<- function(x,y) { sum(y:x) }> # run it> my_function(5,10)[1] 55

Page 226: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

> my_function<- function(y, x) { sum(y:x) }>

> my_function(10)Error in my_function(10) : element 1 is empty; the part of the args list of ':' being evaluated was: (y, x)> my_function(5,10)[1] 45> my_function(9,10)[1] 19

Page 227: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Arguments• Sometimes we want default inputs• We can set defaults to arguments, like this

> my_function<- function(y=5,x=10){

sum(y:x)

}

> my_function()

[1] 45

> my_function(5,10)

[1] 45

> my_function(5,11)

[1] 56

Page 228: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Iterators/loops

Often, we want to “walk over” a set of numbers, a set of vectors, and repeat an analysis

Like applying the mean_and_max function to 30 experiments,one by one

Or, in the flowers.txt file, to plot all columns, each in a separate picture

Page 229: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

The for() loop

Best described by example – this will print all values in the vector 1:5

for(i in 1:5){

print(i)

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

Page 230: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Challenge

• Rewrite this program to write out the square root of each value in the vector x

my_function<-function(x){

for(i in x){

print(i)

}

}

Page 231: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

my_function<-function(x){

for(i in x){

print (sqrt(i))

}

}

Page 232: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Return valuesMost functions in R return a value (or

something more complex) instead of writhing on the screen

There are two ways of doing this

1)Per default, R will return the last statement evaluated

2)We can instead force returning something by saying return (whatever). If R returns something it will also end the function (!)

Page 233: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Examples

my_function<- function(x){

x^2 + x + 1

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

my_function<- function(x) {

my_val <- x^2 + x + 1

return (my_val)

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

Page 234: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Hardest challenge so far...

• Rewrite this program to RETURN – not print - a vector which is the square root values of x (where x also is a vector)

my_function<-function(x){

for(i in x){

print (sqrt(i))

}

}

#look at next slide before continuing

Page 235: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Why is this the wrong solution:

my_function<-function(x){

for(i in x){

return(sqrt(i))

}

}

my_function(c(2,3,4))

[1] 1.414214 # we should get three values

Page 236: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

my_function<-function(x){

for(i in x){

return(sqrt(i))

}

}

We need to somehow build a new vector, and fill that in with the square roots as the for loop progresses

Hint: we can append some value to a vector by saying

my_vector<-c(my_vector, somevalue)

Page 237: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

my_function<-function(x){ my_vector<-c(); # empty vector

for(i in x){my_value<-(sqrt(i))# append this to the vectormy_vector<-c(my_vector, my_value)# and repeat

}return(my_vector)

} my_function(c(2,3,4))[1] 1.414214 1.732051 2.000000

Page 238: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Extensions to R• R has many “add-on” packages for specific

analyses

• Some of these you already have, but they are not accessible unless you “ask” for them

• Others you need to download

• Bioconductor, for array analysis and more, is such an addon. We will use this later in the course

Page 239: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Testing out Lattice graphics

• Different type of plotting methods: basically a “conditional plot”

• This is a “rewrite of R’s standard “plot()” command – called “xyplot()”

Page 240: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Testing…

my_data<-read.table("flowers.txt", h=T)names(my_data)attach(my_data)# complex command ahead!xyplot(Sepal.Length~Sepal.Width|Species )

# wont work! Why?

Page 241: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Need to “activate” the library first

• library(lattice)

xyplot(Sepal.Length~Sepal.Width|Species )

help(package=“lattice”) will give an overview

Page 242: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

xyplot(Sepal.Length~Sepal.Width|Species )

Interpreted asPlot length as a function of width, broken up by species

Page 243: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Free-form exercise/homework

Using reference sheet and exploration of the drop-down menus in R, figure out how to get, install and run the barplot2() command in the gplots package (it is NOT in your computer)

Page 244: R basics Albin Sandelin. Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of

Now what?• With the commands and concepts we have

looked at so far, you can do quite a lot• But, we have only scratched the R surface• Best to view the content so far as a “driving

license”: now you are capable of exploring by yourself

• We will now go over to use R in a genomics context