50
A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Embed Size (px)

Citation preview

Page 1: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

A gentle introduction to R – how to load in data and produce summary

statisticsBRC MH Bioinformatics group

Page 2: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Tutorial outline

• How to install R on your own computers– Its free– But its already installed on these computers

• Loading data from excel• Plotting• Summary statistics

Page 3: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Files

• Data and slides on:• http://core.brc.iop.kcl.ac.uk/brc-

bioinformatics-workshop-october-2012

Page 4: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Show file extensions

Page 5: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Show file extensions

• Uncheck ‘hide extensions for known file types’

• Click ‘Apply’

Page 6: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Installing R – skip as already installed

Page 7: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Installing R – skip as already installed

Page 8: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Installing R – skip as already installed

Page 9: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

And follow operating system specific installation instructions

Installing R – skip as already installed

Page 10: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Starting R on these computers

Page 11: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Help files

Page 12: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Loading help files

• A useful function is read.table()– It allows you to read data from spreadsheets into

R

• To see it’s help file you can use• You can use ?function_name for any function

to see a help file

?read.table

Page 13: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Loading data into R from excel

Page 14: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

From excelOpen testdata.xls

Page 15: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

From excel• You need to save it as a comma separated

value file (.csv), go to file>save as>other formats

Page 16: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

From excel

Page 17: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

R working directory

• To open a file you will need to point R towards the folder that contains it.

• You can do this with setwd(), but we’ll do it using the mouse

• Suppose you have the file in My Documents

Page 18: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Browsing folders• To check that you are in the right folder type

• To see files in this folder you can type

• To list the current variables type

• Nothing should be loaded yet

getwd()

list.files()

ls()

Page 19: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Loading data

To follow along with this section, make sure your R working directory is that which contains the tutorial data

Page 20: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Read the contents of file testdata.csv into an R variable my.data with:

• read.csv is a wrapper for read.table which lets you specify more details about your file, eg:

my.data <- read.csv(‘testdata.csv’)

my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE)

Page 21: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• sep : Column separator• header : Does the first row of the file contain column headers?• skip : Number of rows to skip at the top of the file

• ?read.table for other useful parameters

read.table()

Page 22: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Looking at loaded data

Page 23: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Take a look at the top couple of lines:

• Generate some basic summary stats:

• Check your new variable is in the R environment:

ls()

head(my.data)

summary(my.data)

Page 24: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Number of rows and columns

• Row and column names

• Check the dimensions of your dataset:

dim(my.data)

nrow(my.data)ncol(my.data)

rownames(my.data)colnames(my.data)

Page 25: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Subsetting Data

Page 26: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Look at the first col:

• Look at the third column of row 10

• Look at the first row:

my.data[1,]

my.data[,1]

my.data[10,3]

Page 27: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Look at the first column for rows 100 to 110

• Same as above, but save to a variable

• Same as above but pre-defining the index vector

• Look at rows 30,40,50 and 60

my.data[100:110,1]

my.subset <- my.data[100:110,1]

my.data[c(30,40,50,60),]

my.indices <- c(30, 40, 50, 60)my.data[my.indices,]

Page 28: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Look at the columns named 'height' and 'weight' for row 1:

• Same as above but pre-define the colnames vector

• Look at the column named 'weight' for row 1:

You can subset on names instead of indices:

my.data[1,’weight’]

my.data[1,c(’weight’,’height’)]

cols <- c(’weight’,’height’)my.data[1,cols]

Page 29: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• Look at all columns except the second for row 1

• Extract all rows except 1-100

• Extract all rows except 35, 67,101

Negative indices exclude elements:

my.data[1,-2]

my.new.data <- my.data[-1:-100,]

my.indices <- -1 * c(35, 67, 101)my.new.data <- my.data[my.indices,]

Page 30: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Quiz!

Page 31: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

• How tall is the person in the 7th row?

• What gender is the person in the 300th row?

• For the people in rows 20-30, who is the heaviest?

• For the people in rows 110, 350, 219, 74, who is the tallest?

• Save all rows except 500-600 in a variable my.new.data

• How many males and females are in this new dataset?

Page 32: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Formatting problems

Page 33: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Data isn't comma-separated?

• Specify the separator in read.table

• tab-delimited text is another common format, for which you can use sep=”\t”

Load "testdata.txt", a tab-delimited version of the data

Page 34: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Data has extra header information at the top?

• Either delete this data in Excel before exporting to csv

• Or, use the skip=N argument to read.table

Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

Page 35: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Factors are inconsistently named

• R will just read in the data you give it.

• If you aren't consistent naming the levels of your factors it will see them as different levels

• R is case sensitive. 'MyLevel' != 'mylevel'

Load the data from testdata_2.csv and have a look at the gender variable.

Try and fix the problems in Excel and reload.

Page 36: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Measurements and units in a single column

• If you store values like 10kg, R will not interpret this as a numeric column

Try loading file 'testdata_3.csv' - what has happened to the weights and heights information?

Try loading again so that the two are loaded as character vectors.

Have a look at the sub() function and see if you can fix the problem

Page 37: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Excel has just screwed up your data

• Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version.

Avoid opening large datasets in Excel, use R

• Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened?

my.genes<-c('MASH1','SOX2','OCT4')write.csv(my.genes, file='mygenes.csv')

Page 38: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Plotting

Page 39: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Drawing histograms

Optional exercises –

1) Try drawing a histogram of height

2) Try and label the x axis [hint: read the help file]

Page 40: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Drawing normal QQ plotsqqnorm(my.data$weight);qqline(my.data$weight)

Page 41: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Drawing scatterplots

Optional exercises: try these, do you understand this plot?

plot(height~weight,data=my.data)

plot(height~weight,data=my.data,col=as.numeric(gender))

Page 42: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Drawing boxplotsboxplot(height~gender,data=my.data)

Page 43: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Saving plots

JPEGs

PDFs

jpeg(“boxplot.jpg”)boxplot(height~gender,data=my.data)dev.off()

pdf(“boxplot.pdf”)boxplot(height~gender,data=my.data)dev.off()

Page 44: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Summary statistics

Page 45: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Functions Covered

http://www.statmethods.net/index.html

Page 46: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Writing tables

Page 47: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Calculate Mean and SD

Page 48: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Correlate phenotypes and test for group differences

Page 49: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

It is always important to check model assumptions before making statistical inferences

Page 50: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Linear regression