23
R for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 What can R do for me? .............................................. 2 What exactly is R? ................................................. 4 How long will it take me to learn R? ....................................... 5 How much does R cost? .............................................. 5 How do I install R? ................................................. 6 Exercises and activities .............................................. 6 Getting started in R. Finding help and extending R with packages 6 What do I do now? ................................................. 7 How do I work with R? .............................................. 9 How do I get help? ................................................. 10 How do I use the R-help list? ........................................... 12 How do I extend R by installing packages? .................................... 12 Exercises and activities .............................................. 13 Vectors: Working with one variable at a time 13 How can I use R as a calculator? ......................................... 14 How do I assign values to a variable? ....................................... 15 Where do the numbers go? ............................................ 16 How do I transform a vector? ........................................... 17 How do I generate a sequence of numbers in R? ................................. 17 Simple statistics in R: A t-test demonstration 17 How do I run statistical analyses in R? ...................................... 17 How do I simulate some data from a known population? ............................ 17 How do I test for normality? ........................................... 18 How do I run a t-test in R? ............................................ 19 1

R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

Embed Size (px)

Citation preview

Page 1: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners

Duncan Golicher

November 17, 2008

Contents

Introduction 2

What can R do for me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

What exactly is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

How long will it take me to learn R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

How much does R cost? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

How do I install R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Exercises and activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Getting started in R. Finding help and extending R with packages 6

What do I do now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

How do I work with R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

How do I get help? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

How do I use the R-help list? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

How do I extend R by installing packages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Exercises and activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Vectors: Working with one variable at a time 13

How can I use R as a calculator? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

How do I assign values to a variable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Where do the numbers go? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

How do I transform a vector? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

How do I generate a sequence of numbers in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Simple statistics in R: A t-test demonstration 17

How do I run statistical analyses in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

How do I simulate some data from a known population? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

How do I test for normality? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

How do I run a t-test in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1

Page 2: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

How can I break down the calculations of a mean step by step? . . . . . . . . . . . . . . . . . . . . . . . . 20

How do I calculate a sample standard deviation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

How do I calculate a standard error? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

What does the t-distribution look like and how do I use it to test a null hypothesis? . . . . . . . . . . . . 22

Introduction

Don’t do Sudukos, use R

This document is very loosely based on the material that I have previously taught to master’s students in spanishsince 2004. I have translated it and brought it up to date in the hope that it might be useful as an introductionto R for a general audience of postgraduate students and researchers. I am an ecologist, not a statistician andhave no formal training in mathematics. I feel this is probably an advantage when teaching introductory mate-rial. It allows me to appreciate the difficulties many have with R and try to draw on my own experiences in or-der to help. My first contact with R was far from encouraging. A statistical colleague mentioned R in an email,claiming it to be the most useful piece of software around. I downloaded and installed R version 1.7. The icon onmy desktop was removed by Windows after about six months due to lack of use. I simply couldn’t see what to dowith it. Eventually I did find some time to experiment. I followed the precise but rather terse introductory mate-rial available at the time. This was hard, but fortunately previous experience with computer languages did helpme to get started. I began to use R and quickly found that my colleague was right all along. Using R, insteadof a typical combination of Excel and SPSS does increase the productivity of non statisticians. I was amazed tofind that a single line of R could to do things that were very difficult to achieve in other ways. I quickly became aconvert and wanted to spread the word.

My subsequent experiences have been both encouraging and frustrating. I have aimed at helping students and re-searchers to follow my path and adopt R as their preferred platform for data analysis. Overall I have been pleas-antly surprised by the high number of successes. I am often contacted for advice by those I have taught severalyears ago as they begin to use R for serious research. At the same time I have been frustrated by the difficulty inpersuading more to adopt and use R on a routine basis. I have noticed two quite distinct classes of recalcitrants.Students who have never previously seen a command line are quite understandably put off at the start. The Rstyle of working does not offer immediate attractions. R has a very steep learning curve. With some gentle en-couragement this barrier can often be overcome, but it is always a struggle. The second class of potential R usersare much more difficult to persuade. They are experienced researchers who have invested considerable effort inlearning how to use alternative statistics package such as SPSS or SAS. For a researcher time is the resource inshortest supply. Learning a new, superficially more complicated, way to analyse data appears to be a luxury theysimply can’t afford. This is also understandable. However it is regrettable. This is the class of users that couldpotentially benefit most from a working knowledge of R. While students can be forced to use R through givingthem evaluated course work, there is little that can be used to encourage researchers apart from demonstratingresults that they would like to produce themselves.

This course aims to help absolute beginners of both types to move up the initially steep gradient of the learningcurve and begin to enjoy using R. Once the hard barriers are overcome potential users should find that develop-ing skills in R is a satisfying experience in its own right. At the same time they will find that knowledge of R canresult in much greater scientific productivity. I have aimed for an informal style as much as possible in order toengage with the reader and help to soften the impact. I have also adopted a question and answer format for mostof the document. At each stage I will try to identify the key “FAQs” and provide my own answer to them.

What can R do for me?

An example of potential productivity gains is provided by this document itself. I have written the material us-ing Sweave in Latex (Lyx). This is a great combination of tools for scientific writing. It allows me to embed R

2

Page 3: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

code directly within the text. This code is run when the document is compiled. The output is thus incorporatedinto quite a professional looking pdf document with no effort. Once the document is set up I don’t need to thinkabout the formatting or typesetting. I can simply work on the content. This detail alone shows the immensepower of R and its associated tools. The potential for productivity gains once one has learnt how to use themis endless.

I will be explaining what “R code” is later on. However here is an example of a simple line of code that producesoutput.

library(fortunes)fortune("pizza")

Roger D. Peng: I don't think anyone actually believes that R is designed tomake *everyone* happy. For me, R does about 99% of the things I need to do, butsadly, when I need to order a pizza, I still have to pick up the telephone.Douglas Bates: There are several chains of pizzerias in the U.S. that providefor Internet-based ordering (e.g. www.papajohnsonline.com) so, with theInternet modules in R, it's only a matter of time before you will have apizza-ordering function available.Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interfaceunder R for Windows, but Guido forgot to include it) provides one (for use inSydney, Australia, we presume as that is where the GraphApp author hails from).Alternatively, a Padovian has no need of ordering pizzas with both home andneighbourhood restaurants ....

-- Roger D. Peng, Douglas Bates and Brian D. RipleyR-help (June 2004)

Perhaps pizza lovers outside Australia will be disappointed. However this exchange on the R help list makes avery serious point. I frequently use R to bring in data from the Internet and analyse it. R is often used in financefor producing automated reports on the state of the stock market. Geneticists can use R to consult huge databanks. R is very well integrated within the contemporary “cloud” style of scientific computing.

As an example of the unusual ways that R can be used this function fetches the daily sudoku puzzle from

http://www.sudoku.org.uk/

library(sudoku)puz <- fetchSudokuUK()puz

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9][1,] 0 0 0 0 1 3 0 5 0[2,] 1 5 0 9 4 2 3 0 6[3,] 0 0 0 0 0 0 0 0 0[4,] 0 2 0 0 0 6 0 3 0[5,] 0 0 0 0 9 0 0 0 0[6,] 0 7 0 2 0 0 0 8 0[7,] 0 0 0 0 0 0 0 0 0[8,] 8 0 3 4 7 1 0 2 9[9,] 0 9 0 8 3 0 0 0 0

As Greg Snow, the author of the package notes “Don’t submit your solution for the prize contest if you usedsolveSudoku or playSudoku with solve=TRUE”. That would be cheating.

solveSudoku(puz)

3

Page 4: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9][1,] 6 4 9 7 1 3 2 5 8[2,] 1 5 8 9 4 2 3 7 6[3,] 2 3 7 5 6 8 1 9 4[4,] 9 2 4 1 8 6 7 3 5[5,] 5 8 1 3 9 7 4 6 2[6,] 3 7 6 2 5 4 9 8 1[7,] 7 1 5 6 2 9 8 4 3[8,] 8 6 3 4 7 1 5 2 9[9,] 4 9 2 8 3 5 6 1 7

Solving Sudokus may seem trivial, but it certainly shows the power of the R language to save at least someone’stime. It is now very difficult to identify forms of serious scientific computing that haven’t been implemented in R.However there is an important caveat that is very neatly summed up in this exchange.

fortune("Yoda")

Evelyn Hall: I would like to know how (if) I can extract some of theinformation from the summary of my nlme.Simon Blomberg: This is R. There is no if. Only how.

-- Evelyn Hall and Simon 'Yoda' BlombergR-help (April 2005)

In other words, although you can do almost anything in R, it is often far from obvious how. The main aim ofthis course is to help you get to the stage where you can begin to find ways to solve your own problems using R.

Some very advanced analysis can only be achieved using R. As Frank Harrell notes, R is at the cutting edge.

fortune(10)

Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities(last year it was about 10 years behind) in my estimation.

-- Frank Harrell (SAS User, 1969-1991)R-help (September 2003)

fortune(120)

Rene M. Raupp: Does anybody know any work comparing R with other (charged)statistical software (like Minitab, SPSS, SAS)? [...] I have to show it's asgood as the others.Kjetil Brinchmann Halvorsen: Sorry. That will be difficult. Couldn't it do toprove it is better?

-- Rene M. Raupp and Kjetil Brinchmann HalvorsenR-help (May 2005)

What exactly is R?

This is not an easy question to answer. In one sense the flexibility and power of R means that it becomes some-thing different to every user. The conventional answer is that R is a system for statistical computation and graph-ics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain systemfunctions, and the ability to run programs stored in script files. In other words R is both a computer languageand a set of procedures that have already been implemented in order to carry out specific tasks.

4

Page 5: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

There was a time when most computer users had a working knowledge of computer programming. This is nolonger the case. In fact to a very close approximation none of the students I have taught R had ever written aline at a command prompt before. This is not surprising given the universality of the point and click menu driveninterfaces that have made computers universally accessible. What is more surprising to the students who havelearned R is the discovery that many statistical tasks are actually easier to perform using a command than by us-ing a menu. Doing statistics and working with tables of numerical data is not the same sort of task as word orimage processing. There are many instances where menu driven approaches simply get in the way and cause con-fusion. This will become clearer as we go on. At this point the simple message is that R is a computer language,but you certainly do not need to be a computer programmer to use it. However you might need to develop someprogramming skills to finally get the most from it.

fortune(52)

Can one be a good data analyst without being a half-good programmer? The shortanswer to that is, 'No.' The long answer to that is, 'No.'

-- Frank Harrell1999 S-PLUS User Conference, New Orleans (October 1999)

How long will it take me to learn R?

The rest of your life. There is far too much in R to learn in any less time than that. However getting up to speedin R probably takes a couple of months, assuming that you are prepared to put in fifteen to twenty minutes a daypractice with simple exercises. If you regard learning R as an enjoyable mental challenge that also will make youmore productive this is not unreasonable. Having an experienced R user to hand will help to prevent time wastedas a result of misunderstandings.

fortune("learning curve")

The learning curve is steep - but then like many people, I'd like to be able todo sophisticated modelling with deep understanding and no effort :-)

-- Sean O'Riordain (in a thread about the helpfulness of documentation)R-help (July 2005)

How much does R cost?

Nothing. R is Open Source software. You can redistribute it and/or modify it under the terms of the GNU Gen-eral Public License as published by the Free Software Foundation A copy of the GNU General Public License isavailable via WWW at

http://www.gnu.org/copyleft/gpl.html

The fact that R is Open Source does not mean that you are likely to actually make any changes to the sourcecode yourself. However the fact that others can read and modify the code gives R an important edge for scien-tific users. It means that no-one has to reinvent wheels. Once a statistical procedure is written for R and foundto work correctly it can be reused and modified as necessary. Also despite the rather worrying cautionary noticethat “R is free software and comes with ABSOLUTELY NO WARRANTY” all the most important code in R hasbeen closely scrutinised by the finest academic statisticians in the business. R is reliable and accepted as such byscientific publishers. An R user is in the priviledged position of standing on the shoulders of giants.

fortune("dodgy software")

5

Page 6: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

Mingzhai Sun: When you use it [R], since it is written by so many authors, howdo you know that the results are trustable?Bill Venables: The R engine [...] is pretty well uniformly excellent code butyou have to take my word for that. Actually, you don't. The whole engine isopen source so, if you wish, you can check every line of it. If people were outto push dodgy software, this is not the way they'd go about it.

-- Mingzhai Sun and Bill VenablesR-help (January 2004)

How do I install R?

R can be run under Windows, Linux, Mac or Solaris and probably most other platforms. A quote from BarryRowlinson that still has not made it into the Fortunes package “I’d like to see a Nintendo Wii port, just so I canplay Super Mario Generalised Linear Modelling by waving the controller around.”

At the time of writing R for Windows is available by clicking on the link below.

http://cran.r-project.org/bin/windows/base/R-2.8.0-win32.exe

Versions change over time. You should ensure that you install the latest version (for example R-2.9.0-win32.exewhen it becomes available). Linux users now find versions of R in the standard repositories of most popular dis-tributions. In my own case I prefer to run R on Ubuntu. This Linux distribution is laptop friendly and handlesmost dependency issues very cleanly. Packages can be installed using Synaptic or aptitude. As a minimum r-base, r-base-dev, r-base-core,r-base-html and r-base-latex are needed. A search for r-cran in the Synaptic PackageManager will show a large number of extension packages. It is probably worth installing all of them right fromthe start in order to save time when they are found to be needed. Linux users can also install RKWard whichprovides a sophisticated graphical interface for R. RKWard also includes a large number of scripts for routine sta-tistical analysis. Together they form a user friendly free alternative to SPSS.

Windows users can obtain a graphical interface from the sciviews project.

http://www.sciviews.org/

Even if the full GUI is not used I recommend the program Tinn-R for editing R scripts with syntax highlightingif you work under Windows.

http://www.sciviews.org/Tinn-R/index.html

However this said, I will not be using any graphical interface to R in this course. All the material will be generic,cross platform and robust to any future changes in versions of R. I will explain the dynamic involved in using thematerial in the next section where I will also explain the general procedure for extending R using packages thatcan be used by users on any platform.

Exercises and activities

1. Install the latest version of R on your computer.

2. Investigate the contents of CRAN (http://cran.r-project.org/)

3. Download all the relevant manuals and courses from the “Contributed” section of CRAN.

4. Browse the R graphics gallery (http://addictedtor.free.fr/graphiques/)

Getting started in R. Finding help and extending R with packages

Just to get started I will assume that you are running R in Windows. Mac users have a similar experience. In thecase of Linux R is run by typing R in the terminal. By default in Linux no GUI is produced. I will assume that

6

Page 7: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

Linux users are already hardened to this style of working. On starting R in Windows you are presented with thesuperficially unhelpful looking interface shown below.

What do I do now?

This is always where the panic starts. The Windows GUI version at least gives you a few things to click on (theconsole version in Linux just has a cursor), but they don’t seem to do very much. There are no inmediate indica-tions that R can do any statistics. The experience for many is quite offputting.

The first thing to point out is that this interface is in fact more helpful than it looks at first sight. But let’s re-turn to that later. First just to break the ice lets make R do something. Anything.

At this point I will introduce the convention that will be followed throughout this course. All lines that appear inthe format below can (and should) be either typed directly into the console or copied and pasted from this docu-ment.

demo(graphics)

They will then “run”. So if you type demo(graphics) things will start happening. In this case R runs through anumber of scripts that give some examples of the sort of graphical output it is capable of. This particular demo isnow quite old and doesn’t really do full justice to R’s graphical potential. As the script runs you will be promptedto press return to get more output. I have included the first graph of several such demos below.

demo(graphics)

7

Page 8: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

0 10 20 30 40 50

−2

−1

01

2

●●

●●●

●●

●●

Simple Use of Color In a Plot

Just a Whisper of a Label

demo(image)

x

y

100 200 300 400 500 600 700 800

100

200

300

400

500

600

Maunga Whau Volcano

col=terrain.colors(100)

demo(persp)

8

Page 9: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

x

yz

.

z == Sinc(( x2 ++ y2))

demo(plotmath)

Arithmetic Operators

x + y x ++ y

x − y x −− y

x * y xy

x/y x y

x %+−% y x ±± y

x%/%y x ÷÷ y

x %*% y x ×× y

x %.% y x ⋅⋅ y

−x −− x

+x ++ x

Sub/Superscripts

x[i] xi

x^2 x2

Juxtaposition

x * y xy

paste(x, y, z) xyz

Lists

list(x, y, z) x,, y,, z

Radicals

sqrt(x) x

sqrt(x, y) y x

Relationsx == y x == y

x != y x ≠≠ y

x < y x << y

x <= y x ≤≤ y

x > y x >> y

x >= y x ≥≥ y

x %~~% y x ≈≈ y

x %=~% y x ≅≅ y

x %==% y x ≡≡ y

x %prop% y x ∝∝ y

Typeface

plain(x) x

italic(x) x

bold(x) x

bolditalic(x) x

underline(x) x

How do I work with R?

Fear of the command line seems to be the biggest barrier to using R. At the same time adopting a script basedapproach to data analysis is the greatest advantage of R. So, it is worth taking some time at this stage, to ex-plain carefully how to work with the R console. Why is the R interface so minimalist? When you realise whatR can do and think carefully the reason becomes obvious. A menu based GUI for a statistics program is simplya way to trace a path to a function and prompt the user for inputs to that function. So, in SPSS or Excel, you

9

Page 10: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

might drill through a couple of layers of menu in order to find the function to produce a boxplot of a vector ofdata called, say, treediams. In R you write boxplot(treediams). Once you have learnt basic commands this is anefficient way of working. Statistical analysis is a vast subject. How many different function calls do you need inorder to be able to apply any of the multitude of statistical methods that are available in R? The answer is un-known, but it must easily be in the tens of thousands. A menu based GUI that provided access to all of themwould be huge. It would no doubt look much more threatening to new users than the command line. Even itit were to be built, few would use it. In many cases it would still be quicker to type commands once they areknown.

Speed isn’t the only advantage of using written commands to run analyses. The biggest advantage is that everystep of the analysis is documented. You can collect the steps you took to produce a figure or table together into ascript and reproduce the results exactly. So, the typical method of working is to open R and also open a text ed-itor like notepad. I recommend TinnR for Windows users as it has built in syntax highlighting. I usually experi-ment with a command in the console first. When I find it does what I intended I copy and paste it to the script Iam building. At the end of a session using R I have a complete record of what I have been doing.

How do I get help?

So, how do you know what commands are available? There are two complementary ways. The first is to followa book or course like this one that introduces you to commands in a logical sequence. The other is to use thecomprehensive R help system. The R help system will not teach you any statistics nor will it explain why youmight want to run a function. However it will show you how to run almost all the functions in R and also providean example of their use. If it takes a little effort to find out how to run a function users might be encouraged tospend more time finding out why and whether they need it.

fortune(51)

The documentation level of R is already much higher than average for opensource software and even than some commercial packages (esp. SPSS is notoriousfor its attitude of "You want to do one of these things. If you don'tunderstand what the output means, click help and we'll pop up five lines ofmumbo-jumbo that you're not going to understand either.")

-- Peter DalgaardR-help (April 2002)

You can often simply substitute the data used in the example for your own to get results. You can open a webbrowser interface to the help system from the console.

This will take you to the page shown below. You can also do the same by writing a command to call a function,as that is all the GUI really does.

help.start()

10

Page 11: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

The most used links on this page are Packages and Search Engine and Keywords. To use the search engine youneed java installed. Try searching for histogram.

This will then show a number of links to functions associated with the production of histograms. Try lookingat the function “hist”. You will find a standard page which is the same for all functions, including sections la-belled Description,Usage, Arguments, Details, Value, References, See Also and Examples. Probably the mostimportant section at this stage is Examples. This provides you with a template for using the function. All theexamples can be run in the console either by copying and pasting the code on the help page or by typing exam-ple(“thefunctionyouwant”). If you have a good idea of the name of the function the help page will be shown bysimply typing ?hist or help(“hist”).

help("hist")example("hist")

Histogram of islands

islands

Fre

quen

cy

0 5000 10000 15000

010

2030

40

Histogram of islands

islands

Fre

quen

cy

0 5000 10000 15000

010

2030

40

41

2 1 1 1 1 0 0 1

Histogram of sqrt(islands)

sqrt(islands)

Fre

quen

cy

0 20 60 100 140

05

1525

35

Histogram of sqrt(islands)

sqrt(islands)

Den

sity

0 20 60 100 140

0.00

0.04

0.08

11

19

532

1 0 0 2 3 2

11

Page 12: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

How do I use the R-help list?

If you have real problems with R you can get direct help from the best in the business. These are the set of pro-grammers, developers and long time R users on the R-help list. To subscribe or unsubscribe visit https://stat.ethz.ch/mailman/listinfo/r-help or, via email, send a message with subject or body ’help’ to [email protected]. However you should think carefully before using this fantastic resource directly. All the previous an-swers to questions are held on line and are found easily by the usual search engines. Trivial questions that havealready been answered are rarely tolerated. R developers are extremely busy people who have neither time norinclination to help with “homework”. The posting guide states.

1. Do help.search("keyword") and apropos("keyword") with different keywords (type this at the R prompt).

2. Do RSiteSearch("keyword") with different keywords (at the R prompt) to search R functions, contributedpackages and R-Help postings. See ?RSiteSearch for further options and to restrict searches.

3. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt)

4. If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it.

5. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html)

6. Read at least the relevant section in An Introduction to R

7. If the function is from a package accompanying a book, e.g., the MASS package, consult the book beforeposting

8. It helps to provide a small example that someone can actually run.

fortune("demigod")

You may have not been long enough on this list to see that some of the old-timegurus have reached a demigod like status. Demigods have all rights to be 'rude'(that's almost a definition of a demi-deity).

-- Jari Oksanen (in a discussion on whether answers on R-help should be morepolite)R-help (December 2004)

How do I extend R by installing packages?

A fundamental concept of R is the idea of packages. The initial instalation of R provides a base of functions mostof which have been developed and maintained by a small core team of programmers. However R is capable ofcarrying out a huge number of additional analytical techniques. These are often written in R itself. Fortan orC code can also be linked into R and run as R commands. It is this extensibility that has led to R becomingthe lingua franca of statistical computing. The biggest challenge is keeping up with the vast number of packagesavailable and being aware of what is available. It is safe to assume that someone has implemented almost anystandard technique you might need in R.

The list of packages with a short description of what they do can be found in CRAN. CRAN means “Compre-hensive R Archive Network” and is mirrored throughout the world. Because the list of packages is now so vast a“Task View” section has been set up that helps users to find packages associated with specific types of work. Forexample the “Spatial” section would be a first stop if you are interested in using R for processing geographicalinformation. “Environometrics” shows some of the most useful packages for ecologists and resource managers.

The important element to remember with regard to packages is the difference between installing a package andmaking it available for use during an R session. When a package is installed it is downloaded to your hard disk

12

Page 13: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

and can be used. This needs to be done only once, with the exception of updating packages as new versions be-come available1.

Packages can be installed under Windows from the graphical interface by choosing install packages(s) under thePackages menu. Again the job can also be done through a command. This is my preferred way of installing pack-ages. The following line will install the package vegan, a key tool for multivariate analysis in ecology, and vcd forvisualising categorical data. The addition of dep=T tells R to install all other packages upon which these pack-ages depend.

install.packages(c("vegan", "vcd"), dep = T)

The notion of dependencies is well known to those who use open source software.

There are some key points about R to mention at this point. First R is case sensitive. The line below will notwork.

Install.packages(c("vegan", "vcd"), dep = T)

The next point is that you only need to install a packege to the hard disk once. However must load it into mem-ory every time you need to use a function from the package. This is achieved using the command library. Forexample

library(vegan)

Makes the vegan package available for use. This will become clearer over time.

Exercises and activities

1. Make a list of the packages on CRAN that are potentially useful for multivariate analysis.

2. Run examples of canonical correspondence analysis and non metric multidimensional scaling using the pack-age vegan. (Note, you do not necessarily have to understand the analysis at this stage, the exercise is aimedat practice in using the help system and examples)

3. Install the packages nortest and moments. What do they do and how might they be useful?

4. Run an example of a test for normality

Vectors: Working with one variable at a time

The first goal when you begin working with R is to become sufficiently comfortable with the underlying conceptsof the R language to be able to manipulate data easily. This ability does not come overnight. You will need topractice with a lot of examples. At first it may seem difficult to achieve simple results. The pay back is that withexperience it will become simple to achieve difficult results.

1Linux users will already be familiar with this concept. Debian users also have the advantage that packages installed with apti-tude will be automatically updated.

13

Page 14: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

How can I use R as a calculator?

R can be used as a scientific calculator. Any operation written in the console will be evaluated and the result re-turned to the console.

1 + 1

[1] 2

More complex operations follow the typical operator order. Be careful to use brackets correctly. An extra bracketdoesn’t do any harm, but leaving one out may give results you don’t expect.

1 + 1 * 3

[1] 4

(1 + 1) * 3

[1] 6

3 * 100/10 + 5

[1] 35

3 * 100/(10 + 5)

[1] 20

(3 * 100)/(10 + 5)

[1] 20

10 * (3 - 1)

[1] 20

10 * (3 - 1)^2

[1] 40

10 * 3 - 1^2

[1] 29

14

Page 15: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

How do I assign values to a variable?

You may have noticed that the file menu in the Windows R console does not provide an obvious way of gettingdata into R. On the introductory R help page there is a link to a document called R Data Import/Export. Thisis a comprehensive and useful document for experienced R users written by Brian Ripley, a well known R Guru.

fortune(47)

Seldom are prizes, credit, and gratitude given, else Brian would be drowning inthem.

-- Anthony Rossini (about the merits of implementing software)R-help (May 2004)

However I do not recommend it to beginners. This course (will) contains a whole section devoted to importingand exporting data from other statistical packages such as SPSS and from spreadsheets and data bases. For thetime being we will enter data “by hand” in the console. If you type

x <- scan()

You can enter numbers one by one until you press “enter” twice in a row to exit.

A more reproducible way of assigning numbers to a vector is by concatenating. This can be included in a script.

x <- c(1, 3, 6, 7, 9, 10, 12, 23)x

[1] 1 3 6 7 9 10 12 23

A vector is simply a list of numbers in a single dimension. R will refer to all the numbers by the name x andoperate on them. There are various points to mention here. First of all the <- symbol. In an informal sense itmeans “take everything that is at the end of the arrow and put it into the object at the head”. So x<-c(1,2,3)gives x the values of 1, 2 and 3. Usually the arrow points to the right. It is perfectly valid to turn it around, butthis would usually be confusing and is not done. You can also use the equals sign to do the same job.

x = c(1, 3, 4, 6, 7, 12, 23)x

[1] 1 3 4 6 7 12 23

The use of “=” as an assignment operator is common in many computer languages, but I much prefer the arrowsyntax as it avoids confusion.

To move the contents of x to y is simple. Note that x will still contain the same numbers.

y <- xx

[1] 1 3 4 6 7 12 23

Secondly be aware that you must use the concatenation operator c() to form a vector. None of these lines willwork.

x<-1,2,3,5x<-(1,3,4,6,7)

15

Page 16: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

Where do the numbers go?

It wasn’t until I had taught two courses on R and heard this question several times that I realised that those whoask it are expecting a completely non-technical answer. They don’t want to know details about the way R usesmemory. The problem that arises in some student’s minds is related to the almost ubiquitous use of spreadsheets.

fortune(59)

Let's not kid ourselves: the most widely used piece of software for statisticsis Excel.

-- Brian D. Ripley ('Statistical Methods Need Software: A View ofStatistical Computing')Opening lecture RSS 2002, Plymouth (September 2002)

Spreadsheet users are used to typing in numbers. The numbers remain staring at them until they move away.The notion of more abstract data objects is natural to anyone who has rudimentary contact with programminglanguages. However the idea is not intuitive for everyone.

This is an unexpected barrier to communication between those already used to the R way of doing things and thebeginner. It needs dealing with carefully.

My explanation is that as I work on an R session I produce a collection of “objects” held in the computer’s mem-ory. The basic properties of these objects should also be held in my own memory. I need to have a good idea ofwhat I have put into R and why. However I really don’t want to be looking at the numbers themselves all thetime. This just causes clutter and confusion. As you think more about it this seems reasonable. What is the dif-ference in R between a vector called x containing 10 numbers and one containing 10,000? The answer is essen-tially nothing. The following line produces a vector with ten thousand numbers. They are then multiplied by 2.The second line is identical regardless of the size of the vector.

x <- 1:10000x <- 2 * x

Almost anything you can do to one number you can do to 10,000 just as easily, apart from one thing. Look atthem all at once. So as you work with R you must get used to not wanting to look directly at the numbers them-selves. However, it is a good idea to look at properties of the numbers to make sure that everything is as it shouldbe. This can be done through figures and statistical summaries.

A very useful function is str(). This produces a description of the strucure of the data object. In this case it isa vector of numbers. The function summary produces a statistical summary of the numbers and head prints outthe first ten members.

str(x)

num [1:10000] 2 4 6 8 10 12 14 16 18 20 ...

summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.2 5002 10000 10000 15000 20000

head(x)

[1] 2 4 6 8 10 12

16

Page 17: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

How do I transform a vector?

Any operation will be performed on the whole vector. Try these

x <- c(1, 2, 4, 6, 10)2 * xx + 2x^2log(x)log2(x)log(x * 100)exp(x)sqrt(x)

Note that if you don’t assign the results of an operation they are simply printed out and lost. To transform x toits logarithm to the base of 10 you need to write the following line.

x <- c(1, 6, 10, 100, 200)x <- log10(x)x

[1] 0.0000000 0.7781513 1.0000000 2.0000000 2.3010300

How do I generate a sequence of numbers in R?

One of the best features of R is the ease with which you can generate sequences of numbers and simulated datasets

Simple statistics in R: A t-test demonstration

How do I run statistical analyses in R?

The R way of working is quite different to SAS or SPSS.

R can be especially useful for teaching basic statistics because it is easy to break down all the elements used in acalculation as a series of relatively easily understandable steps. Take the example of a t-test, designed to evaluatethe probability that a sample of number could have been drawn from a population with a mean of zero.

How do I simulate some data from a known population?

It is often a good idea to simulate data from a known distribution to understand more fully the logic behind astatistical procedure. In this case we can ensure that the assumptions used and the data coincide. It is then eas-ier to see what problems might be associated with the analysis of a real data set using the same procedure.

set.seed(1)x <- rnorm(10, mean = 1, sd = 2)x

[1] -0.2529076 1.3672866 -0.6712572 4.1905616 1.6590155 -0.6409368[7] 1.9748581 2.4766494 2.1515627 0.3892232

17

Page 18: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

hist(x, col = "grey")

Histogram of x

x

Fre

quen

cy

−1 0 1 2 3 4 50.

00.

51.

01.

52.

02.

53.

0

This example should help to explain the sometimes obscure logic of classic statistical reasoning. We have justchosen ten numbers at random from a known distribution. This is a normal distribution with a standard devia-tion of 2 and a mean of 1. However it is a small sample, so we can never really know that from the ten numbersthemselves.We try to draw inferences based on the limited knowledge these ten numbers provide.

How do I test for normality?

A better question might be, why and when should you bother testing for normality, but we will leave that aside.It is no trouble to run several normality tests in R. The best way is with the package nortest. If you have notdone so you will have to install it from CRAN first. Then we can run an Anderson-Darling, Lilliefors (Kolmogorov-Smirnov) and Cramer-von Mises test.

library(nortest)ad.test(x)

Anderson-Darling normality test

data: xA = 0.2806, p-value = 0.5608

lillie.test(x)

Lilliefors (Kolmogorov-Smirnov) normality test

data: xD = 0.1345, p-value = 0.8754

cvm.test(x)

Cramer-von Mises normality test

data: xW = 0.0395, p-value = 0.6554

We can also test for skewness and kurtosis by installing another small package called moments.

18

Page 19: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

library(moments)agostino.test(x)

D'Agostino skewness test

data: xskew = 0.2962, z = 0.3482, p-value = 0.7277alternative hypothesis: data have a skewness

anscombe.test(x)

Anscombe-Glynn kurtosis test

data: xkurt = 2.2753, z = -0.0702, p-value = 0.944alternative hypothesis: kurtosis is not equal to 3

Notice that testing for significant deviations from the desired normal properties in the case of small samples isnot particularly useful. The null hypothesis is much less likely to be rejected when there is little power avail-able. There is nothing wrong with our assumption that the numbers were drawn from a normal distribution. Inthis special case we know it to be absolutely correct because that is what we told R to do for us. However a hsi-togram of the small sample itself does not look particularly normal. In fact if we repeat the process 36 times wecan see that the histogram of a sample of ten numbers very rarely looks normal even when they are drawn from anormal population.

par(mfcol = c(6, 6), mar = c(0.5, 0.5, 0.5, 0.5))replicate(36, hist(rnorm(10, 1, 2), col = "grey", xlab = "",

+ ylab = "", main = "", axes = F))

This explains why testing for the normality of small samples is far less important than having a good justificationfor assuming that they could have been drawn from a population with normal properties. In this case if negativevalues are not possible the whole process would be completely meaningless. Lets assume that they represent thatdifferences between two paired values which could indeed have either sign.

How do I run a t-test in R?

Now the idea of statistical hypothesis testing is to try estimate the probability attached to various statements wecould make about some underlying population from which these ten numbers were drawn. In this case we know

19

Page 20: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

what this population is but in real life we do not. In this case a fairly reasonable statement to test would be thatthe true difference between the pairs of observation is zero, in other words nothing much is happening.

In a practical research setting you run a t-test in R quickly with a single function. There are various inputs tothe test, but we will use the default two tailed option.

t.test(x)

One Sample t-test

data: xt = 2.5612, df = 9, p-value = 0.03063alternative hypothesis: true mean is not equal to 095 percent confidence interval:0.1476105 2.3812007

sample estimates:mean of x1.264406

How can I break down the calculations of a mean step by step?

So now R has told us what the result should be. Let’s do all the calculations step by step. The mean of the sam-ple is calculated by

x̃= 1n

∑ni=1xi

To have each step of this simple operation calculated in turn we calculate the sum in R and then divide by n

n <- length(x)n

[1] 10

sumx <- sum(x)sumx

[1] 12.64406

meanx <- sumx/nmeanx

[1] 1.264406

This is interesting. The mean of the population from which the numbers were drawn is 1. However the mean ofthe sample is some distance from this. In this case it is greater. If we took another sample the result would bedifferent, perhaps less. It could even be below zero. For example.

mean(rnorm(10, 1, 2))

[1] 1.461282

20

Page 21: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

We can even simulate this 100,000 times, look at the results as a histogram and calculate the proportion of thereplicated samples that have a mean below zero. The theoretical basis of our null hypothesis test is based on thisconcept.

samps <- replicate(1e+05, mean(rnorm(10, 1, 2)))hist(samps, col = "grey", breaks = 20)sum(samps < 0)/1e+05

[1] 0.05695

Histogram of samps

samps

Fre

quen

cy

−2 −1 0 1 2 3 4

020

0040

0060

0080

0010

000

1200

0

Our two problems are that we don’t really know that the population standard deviation is exactly 2 and we don’tknow what the mean is. Testing a hypothesis looks rather tricky. The best we can do is assume that the stan-dard deviation of the sample is an estimate of the standard deviation of the population. In order to conduct anull hypothesis test we ask the question, “What is the probability of obtaining a sample with this mean, or onemore extreme, if the population mean were really zero”. We do this in a slightly indirect way by calculating thet-statistic first, which has a clever built in compensation for small sample sizes.

How do I calculate a sample standard deviation?

Now, how can we do calculate a standard deviation “by hand”? The formula for the sample standard deviation sthat is an“unbiased estimator” of σ is

s=√

1n−1∑ni=1(x̃−xi)2

sumsquare <- sum((x - meanx)^2)sumsquare

[1] 21.93532

meansquare <- sumsquare/(n - 1)meansquare

[1] 2.437258

21

Page 22: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

rootmeansquare <- sqrt(meansquare)rootmeansquare

[1] 1.561172

sdx <- rootmeansquare

Of course R has built in functions for these calculations that save al this rigmarole.

mean(x)

[1] 1.264406

sd(x)

[1] 1.561172

Notice that we’ve got the estimate for the population standard deviation wrong! It should be 2. The result of alower estimate will be to increase a risk of “type one errors”. However this is built into the procedure. If we knewwhat the true standard deviation were we could use the z statistic.

How do I calculate a standard error?

The standard error can be calculated from the standard deviation by dividing by the square root of the samplesize. It represents the variability in means that we would expect if we did as above and took many samples fromthe population

SEx̃ = s√n

se <- sdx/sqrt(n)se

[1] 0.4936859

What does the t-distribution look like and how do I use it to test a null hypothesis?

Now if the under our null hypothesis the population mean were assumed to be zero (µ0 = 0) and the standarderror is estimated as s=0.494 we calculate a t statistic by subtracting the mean under the null hypothesis fromthe mean we obtained (1.264 and dividing by the standard error.

t= x̃−µ0SEx̃

t <- meanx/set

[1] 2.561154

R has built in functions for many statistical distributions. We have already user rnorm to generate the numbers.The t distribution has the same general pattern. 10 simulated values of t with 9 degrees of freedom can be gen-erated by rt(10,df=9). To get the density use ’dt’. ’The cumulative distribution function is given by ’pt’ and ’qt’gives the quantile function. We can use this to plot a density function for t and shade the tails that have valuesequal to or more extreme than the value we got. A t distribution has longer tails than a normal distribution, andthis represents the compensation we are making for the fact that we have to estimate the sd from a small sample.

22

Page 23: R for absolute beginners - Duncan Golicher's weblog · PDF fileR for absolute beginners Duncan Golicher November 17, 2008 Contents Introduction 2 WhatcanRdoforme

R for absolute beginners A gentle introduction Duncan Golicher

plot(function(x) dt(df = n - 1, x), -4, 4)xvals <- seq(-4, -t, length = 50)dvals <- dt(xvals, df = n - 1)polygon(c(xvals, rev(xvals)), c(rep(0, 50), rev(dvals)), col = "gray")xvals <- seq(4, t, length = 50)dvals <- dt(xvals, df = n - 1)polygon(c(xvals, rev(xvals)), c(rep(0, 50), rev(dvals)), col = "gray")grid()

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

func

tion(

x) d

t(df

= n

− 1

, x)

(x)

Now finally the null hypothesis test involves calculating the cumulative area under the two tails. We can use ’pt’to do this.

tail1 <- pt(-t, df = 9)tail1

[1] 0.01531455

tail2 <- 1 - pt(t, df = 9)tail2

[1] 0.01531455

pvalue <- tail1 + tail2pvalue

[1] 0.03062909

23