A Crash Course in R (and other notes) - University of Ottawa · A Crash Course in R (and other...

A Crash Course in R (and othernotes)

Professor Russell Steele (1), edited, with permission, by Dr. Pierre-Jerome Bergeron (2)

(1)Department of Mathematics and Statistics, McGill University

(2) Department of Statistics and Actuarial Sciences, University of Waterloo

A Crash Course in R (and other notes) – p. 1/87

Intro to statistical computing ideas

Statistical computation is a tool, not a proof

Computational results can be used to developintuition, but not to confirmComputational results are also valuable as a form ofdata analysis

Think before you program

More time is wasted on debugging badly written programsthan on anything else

Will this program be used again on another dataset or aspart of another program?

Will other people be using this code?

Can I comparmentalize the programming so tha t it’seasier to improve speed and/or efficiency withoutcompletely re-writing code?

Has anyone else created the tools that I need toaccomplish my task?

Programming bit by bit

Always have a small test dataset for which you knowthe correct answer

Always test the program as you write it

Try to keep things simple and specific at first, thengeneralize

Document, document, document

Write usable re-usable code

Find the balance between hardwiring inputs and usingarguments

Make sure you’re using the right platform for yourproblem

Write code that is abstract but still not annoyingly vague

Open discussion points

Need to familiarize yourself with the computingenvironment

You can get R for free fromhttp://cran.r-project.org/ (Windows and Linuxversions), upgrading to the newest version is a goodidea

If you wish to incorporate C code, you may also needGNU compilers (easy on Mac and Linux, more in thecoming weeks for Windows).

Text editor and other applications

VI and Emacs are two common editors for Mac OS Xand Linux/UNIX systems, although any text editor willdo

XEmacs is also available for Windows

On Mac OS X, X Code allows for editing and compiling,similar to BloodShed Dev

Eclipse, KDevelop and Anjuta are available for Linux

First ‘mission’

If you haven’t done so, download the R statisticalpackage and install on your own machine

Check to see if you the have necessary developmentsoftware installed on your machine

Basic R stuff

Today: Quick review of basic R stuff (large bore potential)

Vectors

Matrices

Functions

Useful commands

Frames, arrays, lists

The basic R files

Anytime you start R, the application looks for two files,.Rhistory and .RData

Each .RData workspace is different

Advantage: You can have several, separateworkspaces for functions, datasets, etc. that can betransferred from machine to machineDisadvantage: You can have several, separate...

The basic R files

For Unix/Linux platforms, if you are starting R in adirectory where these files do not exist, it will createnew ones for you

For Windows users, the workspaces can be saved inweird places (cause Windows is weird), use File ->Load Workspace to find where it is stored on yourparticular machine

Save workspaces on Windows using File -> SaveWorkspace ; on Unix/Linux rename the .RData file aftersaving and exiting R in order to keep from opening itagain the next time.

.RData files can be used on different platformsinterchangeably

Other fundamental R stuff

.Rhistory file keeps all commands for a givenworkspace; the history() command lets you accessthem.

ls() will tell you the contents of the currently loadedworkspace

library( libraryname) will allow you access thefunctions and data in the libraryname library

http://cran.r-project.org/ has a list of librariesand packages that one can download that are not partof the R base package

Vectors

Building blocks of R

Create a vector with the function c( ... )

Example:

> firstvec<-c(89,10,390.38)

Can access parts of a vector using subscripts

Vectors (cont)

By using vector name [i] where i is the ith position in thevector, we can either print:

> firstvec[2][1] 10

or change the contents of the vector:

> firstvec[3]<-271.401> firstvec[1] 89.000 10.000 271.401

Variables

When I talk about variables,just think about variables asvectors with one element

Example:

> firstvar<-5> firstvar[1] 5

Useful for indices and constants that you’ll need later on

Vector Operations

Using operations such as ’+’, ’-’, ’*’, ’/’ can be trickydepending on the lengths of the vectors you’re trying touse:

If the vectors are the same length, the operation isperformed elementwise

If one vector contains one element, then the operationis performed for each element of the vector

Vector Operations

> 5*firstvec[1] 445.000 50.000 1357.005> c(5)*firstvec[1] 445.000 50.000 1357.005> c(5,2)*firstvec[1] 445.000 20.000 1357.005Warning message:longer object lengthis not a multiple of shorterobject length in: c(5, 2) * firstvec> c(5,2,3)*firstvec[1] 445.000 20.000 814.203

More vectors

c(..) can be used to concatenate vectors as well

> secondvec<-c(65, 109, 109.80)> combinedvec<-c(firstvec, secondvec)> combinedvec[1] 89.000 10.000 271.401 65.000[5] 109.000 109.800

Matrix Operations

A matrix can be thought of as a collection of vectors

Using the matrix function, you can create a matrixfrom a single vector, filling the matrix from top tobottom, left to right:

> firstmatrix<-matrix(combinedvec,nrow=3, ncol=2)

> firstmatrix[,1] [,2]

[1,] 89.000 65.0[2,] 10.000 109.0[3,] 271.401 109.8

More matrices

Accessing/changing elements of a matrix requiresspecifying a row index and a column index

> firstmatrix[3,2][1] 109.8> firstmatrix[1,1]<-79> firstmatrix

[,1] [,2][1,] 79.000 65.0[2,] 10.000 109.0[3,] 271.401 109.8

Matrices

Accessing whole columns/rows can be done by leavingthe cumulative index blank

> firstmatrix[1,][1] 79 65> firstmatrix[,2][1] 65.0 109.0 109.8

Matrix operations

R defaults to elementwise operation for multiplicationand division of matrices too

Matrix operations

> matrixa<-matrix(c(1,2,3,4), ncol=2)

> matrixa

[,1] [,2]

[1,] 1 3

[2,] 2 4

> matrixb<-matrix(c(5,6,7,8), ncol=2)

> matrixb

[,1] [,2]

[1,] 5 7

[2,] 6 8

> matrixa*matrixb

[,1] [,2]

[1,] 5 21

[2,] 12 32

Matrix operations

You can use special operators (such as %*% ) to do thenormal matrix multiplication:

> matrixa%*%matrixb[,1] [,2]

[1,] 23 31[2,] 34 46

Matrix operations

To find the inverse of a square matrix, use the functionsolve .

> ainv<-solve(matrixa)> ainv

[,1] [,2][1,] -2 1.5[2,] 1 -0.5> ainv%*%matrixa

[,1] [,2][1,] 1 0[2,] 0 1

Functions

Functions can be thought of as shortcuts to performsequences of calculations

All functions take arguments and most useful functionsreturn objects

Functions can call other functions

Calling functions

We’ve already discussed a couple of functions c(...)and matrix(...)

c(...) takes any number of arguments and returns avector containing those values

matrix(...) takes a couple of different kinds ofarguments: the values for the matrix, the number ofrows, and the number of columns

Other examples of built-in R functions aremean(..) ,sum(...) ,var(...) ,lm(...)

Useful functions for, um, functions

args( function ) will return the possible argumentsfor a function

> args(matrix)function(data = NA, nrow = 1, ncol = 1,

byrow = FALSE, dimnames = NULL)

Useful functions (cont.)

help( function ) will return a description of whatthe function does, the arguments it takes, and thevalues it returns

matrix package:base

Matrices

Description:

‘matrix’ creates a matrix from

the given set of values. ‘as.matrix’

attempts to turn its argument into a

matrix. ‘is.matrix’ tests if its

argument is a (strict) matrix.

Usage:

Writing functions

The easiest way to write functions in R is to create thefunction in an outside text editor such as emacs or vion Linux/Unix machines and Notepad under WindowsOR choose to create a script using the menu in R

A function declaration has a specific format

One must specify the name of the function, thearguments, and the body of the function

In R, the last calculation performed in the functiondetermines what the returned value is

Declaring functions

Either at the command line or in a separate file, write:functionname <- function( argument1,

argument2, ...){

function calculation 1function calculation 2....returned function calculation

First function

The first function we can write is a standard deviationfunction, stdev()

stdev() should take a vector as an argument andshould return the square root of the variance of thevector of observations

Here’s what we would write in our text file stdev.R :

stdev(...) definition

stdev<-function( datavector ){

sqrt( var( datavector ) )

Usingstdev() in R

Can use the source function to load your function intoR if not made in R

> source("[Source directory]/stdev.R")> stdevfunction( datavector ){

sqrt( var( datavector ) )

Usingstdev() in R

Then test out the function using a small dataset forwhich you know the answer

> testvector<-c(7,3,5)> var(testvector)[1] 4> sd(testvector)[1] 2> stdev(testvector)[1] 2

Useful function tips

Typing the function name with no ()’s (like with stdev )will print the function declaration to the screen (workseven with built-in R functions)

Remember to use source each time you makechanges to the text file that contains your function(s)

You can have multiple arguments for any given function

You can also set default values for arguments that takecertain values a majority of the time

Another example function

Let’s say that we want a function that calculates atrimmed mean

Remember that a trimmed mean takes an orderedvector of data and eliminates the first and last X% ofpoints

What arguments should the function take?

What built-in functions will we want to use?

Trimmed mean function

Arguments: Data vector (maybe take a matrix?),percentage (fixed or an argument)?

Built-in functions we’ll need: mean, sum, sort , length?

Step-by-step

Initial function declaration in trimmed.mean.R , just tomake sure the syntax is correct:

trimmed.mean<-function( datavector,trim.percent=5 ){

mean(datavector)

Step-by-step

## Generate 100 Exponential random variables

> trim.test.vector<-rexp(100,1)

# Look at the values: yours will be different

> trim.test.vector

[1] 1.68888770 0.16894626 0.37284041 0.27736393

> mean(trim.test.vector)

[1] 0.9427151

> source("trimmed.mean.R")

> trimmed.mean(trim.test.vector)

[1] 0.9427151

Step-by-step

Now we need to set up the trimming...

We want to eliminate the first X% and last X% of thesorted data before taking the mean

We’ll need a couple of R tricks to do this

length(...) will give us the length of a vector

Can we assume that the data vector is in the correctorder?

What are the indices of the data points that we want toinclude in the mean calculation?

Step-by-step

First, we can use the sort(...) function to sort thedata

Then, we want to drop the same number of points fromeach end of the data set, percent/100 *length(datavector)

So calculate dropnumber and then find the mean ofthe data vector without those 2*dropnumberobservations

Step-by-step

trimmed.mean<-function( datavector, trim.percent=5 ){

dropnumber<-round(trim.percent/100*length(datavecto r))

mean(sort(datavector)[(dropnumber+1):(length(datave ctor)-dropnumber)])

Taking it apart

Round to make sure it’s an integer

dropnumber<-round(trim.percent/100*length(datavecto r))}

sort sorts the incoming datavector and returns thesorted vector’a:b’ is a shortcut used to get a sequence of integersbetween a and bmean(datavector[(dropnumber+1):(length(datavector)- dropnumber)])

Think about what the indices are of the numbers thatwe want to get

Checking the results

> source("trimmed.mean.R")

> trimmed.mean(trim.test.vector)

[1] 0.813712

> trimmed.mean(trim.test.vector, trim.percent=10)

[1] 0.7287425

Why does it work?

sort(....) returns a vector containing the argumentvector’s sorted values

Because sort(...) returns a vector, we can use thebrackets [ ]’s to access part of the vector, i.e.sort(...)[i:j]

while loops

while loops are used to indefinitely repeat sets ofcalculations until a specified goal has been reached

The function first checks the condition; if true, itexecutes the statements in order; if false, it skipseverything and goes to the next statement outside theloop

The syntax for writing a while loop in R is:while( condition ) {

looped line 1

looped line 2

Example: Newton’s method for root-finding

The standard example for stopping conditions isNewton’s method for finding roots of equations in onevariable

Newton’s method uses the derivative of the function todetermine a linear direction towards a root, i.e. it cannotfind all roots of an equation (x s.t. f(x) = 0)

Example: Newton’s method

Arguments: function to be solved, derivative function,tolerance, initial value, maximum iterations

Return: root

What should the condition be?

Let xi be the “guessed” root at the ith iterationIf |f(xi)| > tolerance and the number of iterations isless than the maximum number of iterations, then itshould continue.Otherwise it should stop

Example: Newton’s method

newtons.method <- function( f, derivfunc,

tol=1e-3, init=0, maxiter=100){

x <- init

iter <- 1

while( (abs(f(x)) > tol) & (iter<=maxiter)){

print(x)

x <- -f(x)/derivfunc(x) + x

iter <- iter+1

Example code

> sinfunc <- function(x){ sin(x)*cos(x) }

> sinfunc.deriv <- function(y){ cos(y)*cos(y) - sin(y)*si n(y)}

> newtons.method(sinfunc, sinfunc.deriv, tol=1e-4, init =1)

[1] 2.09252

[1] 1.233947

[1] 1.633093

[1] 1.570472

[1] 1.570796

Unofficial assignment

Try out some of these functions in R

Play around with some other functions like seq , rnorm

Think about how we could have written the trimmedmean function to take a matrix and sort each column

for loops

for loops are used in functions to repeat similar tasks anumber of times

for loops in R are actually rather inefficient (shortstory: shopping for memory is like shopping for toiletpaper - better to buy in bulk)

If necessary, then they are simple to implement

R for loops are actually rather intuitive

for loops

Structure/syntax of a for loop

for ( counter in vector ) {looped line 1looped line 2

The function will perform the list of looped lineslength( vector) times

The counter is a variable that is often used in the codefor calculations or display

Example: Convolving two finite sequences

Convolutions pop up in certain statistical (andmathematical) computations

Imagine that we have two sequences: a0, a1, ..., am−1

and b0, b1, ..., bn−1 and we want to find their convolution:ab0, ab1, ..., abn+m−2 where

abk =k∑

ai ∗ bk−i

for valid indices i and k − i

convolve.rfun function

What are the necessary arguments?

What should be returned from the function?

Arguments: a data vector, b data vector

Should return a vector of length length( a) +length( b) -1

How do we calculate the convolution

convolve.rfun<-function(a,b){

ab<-rep(0,length(a)+length(b) -1)

for(i in 1:length(a)){

for(j in 1:length(b)){

ab[i+j-1] <- ab[i+j-1] + a[i]*b[j]

> source("convolve.R")

> args(convolve.rfun)

function (a, b)

> testavec<-c(1:10)

> testbvec<-c(1:10)

> convolve.rfun(testavec,testbvec)

[1] 1 4 10 20 35 56 84 120 165 220 264 296 315

[14] 320 310 284 241 180 100

The big problem

Our new convolution function works pretty well for smallsequences

The problem is that R is absolutely horrid with loops (asare most interpreted languages)

In fact, if we look at even a moderate sequence, we seethat R can take a long time

Loops and speed issues in R

> testavec<-c(1:700)

> testbvec<-c(1:700)

> system.time(resultvec<-convolve.rfun(testavec,test bvec))

[1] 20.02 0.00 20.15 0.00 0.00

Loops and speed issues in R

We see from the system.time function that ourconvolution function takes about 20 seconds to run

This was not a very long pair of vectors to convolve,only 700 components each in length, and the functionitself is trivial to write

We have one of two choices: we can either try andcome up with sneaky R shortcuts to speed up thefunction (difficult to do in this case) or output a lot of theheavy looping to a C function

if/else statements

if statements allow you to control when certainstatements will be executed in your function

if statements follow the following form:if( condition ) {

calculation 1calculation 2...

The function will only perform the calculations in the {}’s if the condition is true

if/else statements

If you add an else state after your if condition }, thenif condition is false, then the function will execute whatis in the else brackets

if( condition ) { if calculation }

else { else calculation }

running.mean.fast function

We’ll create a new function, running.mean.fast , thatwill calculate a running mean with a given window

Instead of recalculating the mean of all the observationseach time we change the center of the window , we canjust adjust the previous mean by dropping oneobservation and adding one observation

The most sticky part (again) will be the handling of thebeginning and end of the vector

Helps if we divide it into three cases: beginning, middle,end

Dividing into cases

Beginning: Add observations to the total sum and addone to the denominator

Middle: Add one observation to the end, subtract onefrom the beginning, and the denominator stays constant

End: Subtract observations from the total sum andsubtract one from the denominator.

running.mean.fast<-function( datavector, windowsize=3 ){

resultvector<-rep( 0, length(datavector) )

k <- round((windowsize-1)/2)

#get index corresponding to windowsize

#Set initial sum

#for(i in 1:length(datavector)){

# if(beginning) { do beginning calcs}

# else {

# if(middle) { do middle calcs }

# else { do end calcs }

resultvector

currentsum <- sum(datavector[1:k])

for(i in 1:length(datavector)){

#Beginning

if( i <= (k+1) ) {

currentsum <- currentsum+datavector[i+k]

resultvector[i] <- currentsum/(i+k)

#Middle

if((i>(k+1))&(i<=(length(datavector)-k))){

currentsum <- currentsum+datavector[i+k]-datavector[i -k-1]

resultvector[i] <- currentsum/(2*k+1)

currentsum <- currentsum-datavector[i-(k+1)]

resultvector[i] <- currentsum/(length(datavector)-i+1 +k)

resultvector}

Data frames

A data frame is much like a matrix, only more flexible

Allows the user to name columns, which makesoperations, such as using the regression functions lmand glm , easier

User can also attach data frames, which gives eveneasier access to information

The easiest way to read tabular data into R is throughread.table , which returns a data frame

Example data frame: Ships data

Description from R help file:Data frame giving the number of damageincidents and aggregate months of serviceby ship type, year of construction, andperiod of operation.

Using commands

> library(MASS)

> data(ships)

makes the frame available

Example data frame: Ships data

To get the names of the variables in the ships frame,use the names function> names(ships)

[1] "type" "year" "period" "service" "incidents"

Typing ships $’’year’’ accesses just the yearvariable

> ships$year

[1] 60 60 65 65 70 70 75 75 60 60 65 65 ....

[26] 60 65 65 70 70 75 75 60 60 65 65 70 ....

Example data frame: ships data

The variables themselves act like vectors and the dataframes act like a matrix

> ships$service[25:30]

[1] 251 105 288 192 349 1208

> ships[5,]

> type year period service incidents

5 A 70 60 1512 6

> ships[5,4]

[1] 1512

> ships[25:30,1:3]

type year period

25 D 60 60

26 D 60 75

27 D 65 60

28 D 65 75

29 D 70 60

30 D 70 75

Lists are different from data frames in that the differentparts of a list can have different numbers of rows, butyou create data frames and lists from scratch in asimilar way:

> data.frame(x1=c(45,12,13), x2=c(12))

> x1 x2

1 45 12

2 12 12

3 13 12

> list(x1=c(45,12,13), x2=c(12))

[1] 45 12 13

[1] 12

Lists and data frames

Notice how the data frame and the list have the samenames for the variables (or parts), but the list allows fordiffering dimensions of the two variables (or parts)

Data frames are better used for data (duh) and lists arebetter used for complicated data structures returned byfunctions (e.g. lm returns a list that contains residuals,coefficients, etc., all of differing dimension)

Lists and data frames

> names(lm(incidents˜year, data=ships))

[1] "coefficients" "residuals" "effects" "rank"

[5] "fitted.values" "assign" "qr" "df.residual"

[9] "xlevels" "call" "terms" "model"

> is.list(lm(incidents˜year,data=ships))

[1] TRUE

> is.data.frame(lm(incidents˜year,data=ships))

[1] FALSE

Plotting can be done at two levels

The first level is simple to use, but difficult to configure

The second level is very difficult to use, but allows youto control just about everything

Try to avoid throwing your keyboard on the floor whilestruggling to get used to the syntax

Basicplot command

Typing plot( object ) will produce a variety ofresults, depending on what you give the plot command

If you give it a vector, it plots each value vs. its vectorindex

If you give it a frame, it defaults to the pairs command,which prints out all bivariate scatterplots of all pairs ofvariables

If you give it a linear model object, it will output fourplots (residual vs. fitted, q-q plot, fitted vs. std.residuals, and Cook’s distance)

Examples of scatterplots/function plots

x<-runif(1000,-3,3)

y<- x + 3 + rnorm(1000,0,1)

plot(x,y)

−3 −2 −1 0 1 2 3

Example 2

plot(x, y, pch = ’.’ )

−3 −2 −1 0 1 2 3

Example 3

w<-seq(-3,3,length=1000)

plot(w, dnorm(w), type=’l’)

−3 −2 −1 0 1 2 3

Titles, line types, labels, axes, etc.

> plot(w,dnorm(w), main=’Your title goes here’, type=’l’, lty=2)

−3 −2 −1 0 1 2 3

Your title goes here

> plot(w, dnorm(w), type=’l’, main=’Title’, xlab=’X-axis label’,

ylab=’Y-axis label’)

−3 −2 −1 0 1 2 3

X−axis label

> plot(w, dnorm(w), type="l", main="Title",

ylim=c(0,.3), xlim=c(-3,3))

−3 −2 −1 0 1 2 3

Histograms

> histvec<-rpois(1000,5)

> hist(histvec, main="hist(histvec)")

> hist(histvec, main="hist(histvec, nclass=10)", nclass =10)

histvec, nclass=10

histvec

0 2 4 6 8 10 12 14

Arrays

Arrays are data structures that are best used for morethan 2-dimensional data, for instance, arrays of matrices

Array functions like apply will often work on matricesand data frames

For instance, if we use the MASS attitude dataset,then

> library(MASS)

> data(attitude)

> apply(attitude, 2, mean)

rating complaints privileges learning

64.63333 66.60000 53.13333 56.36667

raises critical advance

64.63333 74.76667 42.93333

Why useapply ?

apply should more efficiently perform the function(although the speed up is probably not as fast as youwould expect)

Also makes for cleaner code to cut down on looping>rdmmat<-matrix(rnorm(10000,4*(1:10)),10,1000)

>apply(rdmmat,1,mean)

[1] 4.003565 7.993293 11.983947 16.028533 19.939690 23.93 9630 27.981554

[8] 32.018374 36.047277 39.966829

> apply(rdmmat,1,var)

[1] 0.9376154 0.9554272 1.0340050 0.9289830 0.9847559 1.0 051780 0.9676526

[8] 1.0440103 1.0359232 1.0346870

That’s it!

If you need more help, feel free to contact the instructoror one of the TAs.

I would like to thank Professor Steele for generouslyproviding an extensive R tutorial that I unscrupulouslymassacred into this.

A Crash Course in R (and other notes) - University of Ottawa · A Crash Course in R (and other...

Documents

PowerShell crash course

Illustrator Crash Course

Cucumber Crash Course

NLP Crash Course

Excel Crash Course

Cellular Respiration (Making ATP from food) crash-course-bio ATP crash-course-bio ATP crash-course-bio ATP

Apa Crash Course

Physics 120B: Lecture 1 Course Structure Crash Course for Arduino Crash Course in C

meditation- crash course

Hieratic Crash Course

The O cial MATLAB Crash Course - TAUturkel/notes/matlab_crash.pdf · 2009. 8. 16. · The O cial MATLAB Crash Course Written mostly by Adam Attarian (arattari@unity) September 14,

CRASH COURSE OR COURSE CRASH - Hochschule Anhalt FH · 2015-08-09 · CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain Landscape Architecture

A Crash Course in Fluid Dynamics Contentspleclair.ua.edu/ph126/Misc/Notes/fluids.pdfUNIVERSITY OF ALABAMA Department of Physics and Astronomy PH 126 / LeClair Fall 2009 A Crash Course

Vba Crash Course

MATLAB Crash Course

A crash course

SBT Crash Course

Crash Course Clarinet

Copyright crash course

Crash Course!