24
R/lessR Reference for Basic Statistical Analysis Less Coding for More Results in Pursuit of Statistical Best Practice http://web.pdx.edu/ ~ gerbing/R/R_Reference.pdf David W. Gerbing School of Business Administration Portland State University November 2, 2010 Please provide feedback, comments and critique to the author [email protected]

R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

R/lessR Reference

for Basic Statistical Analysis

Less Coding for More Results in Pursuit of Statistical Best Practice

http://web.pdx.edu/~gerbing/R/R_Reference.pdf

David W. GerbingSchool of Business Administration

Portland State University

November 2, 2010

Please provide feedback, comments and critique to the author

[email protected]

Page 2: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

R/lessR:

Less Coding for More Results in Pursuit of Statistical Best Practice

The goal of the R/lessR project is to provide the most user-friendly statistical software possible. Thissoftware should require minimal input and by default provide complete, useful output while foster-ing statistical best practice. This paper introduces the standard R environment, and explores theconceptual principles that underlie the design of the lessR functions and their implementation.

1. Fewer functions, more analysis. Minimize the number of needed functions for analysis and haveeach function provide an extensive analysis by default.

2. Foster best practice. Automatically provide the analyses needed for statistical best practice in areadily accessible format, without requiring the user to invoke additional functions and options.

3. Meaningful diagnostics. Error diagnostics should be meaningful, providing sufficient detail sothat the user understands the error and how to rectify the error.

4. Polymorphism. Each function should invoke a form of polymorphism in which the function adaptsto the user data and provides what the user likely wishes by default, while still providing fullcontrols, easily invoked, for overriding any default assumptions.

5. Color. The graphical output in particular should include color by default, again allowing the userfull control over the color scheme.

Before R such a project would have required a staff of programmers, financing and establishing a meansof distribution. The recent revolution in statistical computing, in development since the mid-1990’s, isthe open source program R (New York Times, 2009), which runs identically on Windows, Macintoshand Linux, and is available for free to anyone with an Internet connection. To use R is to write code thatconsists of a series of function calls. The available set of statistical functions is extensive, comparableto the best proprietary packages such as SAS. Further, the developers of R provide a readily accessiblemechanism for anyone to provide additional functions to the base R project by bundling their functionsinto what are called contributed packages. The result, building upon the standard R functions, is thatthe desired conceptual principles of statistical software can be straightforwardly implemented, with anavailable world-wide means of distribution.

The default R installation does not provide for any of these five proposed principles of statistical soft-ware. One of the primary advantages of R, its flexibility, and relatively low-level functions, can, inthe context of a researcher simply interested in accomplishing a specific task, become an impediment.For example, several R functions must be successively invoked just to obtain a standard, minimal re-gression analysis, without additional analyses such as outlier detection. This R/lessR project attemptsto do more with less, while retaining access to the full functionality of the R system by definition.This approach is particularly applicable for the introductory student. Much of the motivation for thisproject grew from teaching MBA students basic statistics, so many of the current lessR functions aredesigned to produce a specific data analysis consistent with my teaching, such as an extensive regres-sion analysis, with a single command. In the process, my teaching responsibilities have provided anunsought, but particularly compelling motivation to produce this software, because, believe me, if anMBA student can use R and prefer that use over Excel, then the ultimate criterion of user satisfactionhas been achieved.

Page 3: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

Contents

1 An R/lessR Example 1

1.1 Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Example Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Accessing R 2

2.1 Default Version of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Extend R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.3 Getting Help in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.4 Getting Help in Other Places . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Using R 4

3.1 R input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.3 Worksheet Apps and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.4 Variables in a data frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.5 Read the data into a data frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.6 Assignment operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.7 Combine function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.8 Editing an R statement on the command line . . . . . . . . . . . . . . . . . . . . . . . . 7

3.9 Options and decreased reliance on scientific notation . . . . . . . . . . . . . . . . . . . . 7

3.10 Case sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.11 Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Missing Values 9

4.1 Assign Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Ignore Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Page 4: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

5 Graphics 9

5.1 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2 Colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.3 Write Graphic Directly to a pdf File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.4 Write Graphic Directly to a Bitmap Format . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.5 Location of the Graphic File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.6 Insert PDF Image into MS Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 lessR Features 12

6.1 Fewer functions, more analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.2 Foster best practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6.3 Meaningful diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6.4 Polymorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.5 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Development Environments 16

7.1 Macintosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7.1.1 Recommended Free Text Editor: TextWrangler . . . . . . . . . . . . . . . . . . . 16

7.1.2 Coordinating R files with TextWrangler . . . . . . . . . . . . . . . . . . . . . . . 16

7.1.3 Automatic R Code Execution from within TextWrangler . . . . . . . . . . . . . . 17

7.1.4 Syntax Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.3 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8 R Statistical Functions for Basic Analysis 18

Page 5: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

1

1 An R/lessR Example

1.1 Example Data

Data are in a file called employees.csv at : http://web.pdx.edu/~gerbing/data/employees.csv

Name,Age,Gender,Dept,Salary

"Doe, Bill",48,M,FINC,52325

"Jones, Sally",35,F,ACCT,57000...

"Ritchie, Denise",25,F,MKRT,61750

1.2 Example Program

# load Gerbing’s lessR package, turn off scientific notation, set page width

library(lessR)

options(scipen=30, width=100)

# read the .csv formatted data from the web into a data table called mydata

# alternative: rad() to browse for a .csv file stored on local computer system

rad("http://web.pdx.edu/~gerbing/data/employees.csv")

# summarize all the variables in the data table, as well as Salary by Gender

summary(mydata); sd(Age); sd(Salary)

by(Salary, Gender, summary)

# bar chart of counts of values for Dept

barplot(table(Dept), col="plum")

# color histogram of Salary with custom bins

color.hist(Salary, breaks=seq(40000,90000,5000))

# normal and general densities for Age imposed over histogram

color.density(Age)

# extensive t-test analysis with graphics

smd.t.test(Salary ~ Gender)

# follow-up power analysis for previous t-test

powercurve.t.test()

# extensive 1 and 2-predictor regression analyses with graphics

reg(Salary ~ Age)

reg(Salary ~ Age + Gender)

David Gerbing November 2, 2010

Page 6: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

2

2 Accessing R

2.1 Default Version of R

R is available to anyone with an Internet connection and a computer that runs the Windows, Mac orLinux/Unix OS. Get the latest version of R for free at the following location on the web.

http://cran.r-project.org/

Windows: Click the Windows link near the top of the page. On the new page, click base. A new pageappears again. Click the first link at the top of the page, which begins with Download R followed bythe current version number.

Mac: Click the MacOS X link near the top of the page. On the new page, click the first file to download,under the heading of Files:, which lists the version number followed by (latest version).

2.2 Extend R

R organizes its many functions into groups called packages. The initial installed version of R includes sixdifferent packages, including the stat package for statistical analysis functions, the graphics packageand the base package for various utilities. Many more functions are available in contributed packages,each of which must first be downloaded and then installed onto the computer on which R is run.

The following R functions provide access to the functions provided within a contributed package, suchas Gerbing’s lessR. The functions in this package simplify interaction with R, where “Less is More”.That is, the goal is less R code that yields more results which foster best practices.

> install.packages("lessR")

Download the specified package from the R system onto a specific computer. Do this only once.

> library(lessR)

For each new R session, load the package and its constituent functions into active memory to allowaccess to the functions and associated documentation1.

> help(package=lessR)

Document the purpose of the package and a list of its constituent functions.

> update.packages() or > update.packages(lessR)

Download revisions to all or specified installed packages for which revisions are available.

1As an option, can store this library function call, or any other set of R function calls, in a text file called .Rprofile

that runs the specified instructions each time an R session begins. For Windows, place the file in the top level of theDocuments folder. For Mac and Linux, place the file in the top level of the user’s home folder. However, on the Mac andLinux the Unix convention is followed in which files with names that start with a period are hidden in the file directory,so that a hidden file must be accessed to be re-edited. Google mac "hidden file" for more information, or, to re-edit,just re-compose in preferably a text editor and re-save once again as a text file.

David Gerbing November 2, 2010

Page 7: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

2.3 Getting Help in R 3

There are many contributed packages written for R. Anyone can write a package and make it availablefor R users world wide on the CRAN servers, the Comprehensive R Archive Network. A list of allavailable contributed packages on the CRAN servers is found at the following web address.

http://cran.r-project.org/web/packages/

A description of each function in a package that has been loaded into active memory with the library

function, whether from the default installation by default or from a contributed package, is obtainedfrom the help function (see below). Also, the full reference manual for each package is available fromthe list of R packages by clicking on the name of the corresponding package. Here is the direct link tothe lessR manual and other material.

http://cran.r-project.org/web/packages/lessR/index.html

Many R packages are grouped into what are called task views, so that users are more easily able tolocate packages that provide functions directed towards certain general areas of analysis, such as forthe social sciences, multivariate statistics, econometrics, and psychometrics. Links to these task viewsare found here.

http://cran.r-project.org/web/views/

2.3 Getting Help in R

> help(mean)

Help can be obtained for any R function with the help function and the function name.

> ?mean

Can also abbreviate the help function with a ?.

> help.me()

Although the help function provides much information regarding the specified function, the catch-22 is that first the name of the function must be known. This lessR function lists functions by thetype of analysis performed.

> help(package=lessR)

To see all the functions available within a package, use the package option for the help function.

> help(package=stats)

help(package=graphics)

Or, view the names and a brief description of all the provided functions in the automaticallyavailable stats package and graphics package.

2.4 Getting Help in Other Places

Some useful web references:

David Gerbing November 2, 2010

Page 8: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

4

Robert Kabacoff’s Quick R web site: http://www.statmethods.net/Theresa Scott’s R pdf files: http://biostat.mc.vanderbilt.edu/wiki/Main/TheresaScottmuch graphics info: http://faculty.smu.edu/ngh/stat6304/class_sgraph10.pdf

Also, CRAN maintains a listing of free documentation and papers explaining the use of R.

http://cran.r-project.org/doc/contrib/

Some useful papers in this collection include “Practical Regression and Anova using R”, by Julian J.Faraway under the link Faraway-PRA.pdf. Also of interest is “simpleR – Using R for IntroductoryStatistics”, by John Verzani under the link of Verzani-SimpleR.pdf. Other useful papers also likelyexist in this collection.

The official R manuals are also available.

http://cran.r-project.org/doc/manuals/

Two books I ordered for the PSU library:

Introductory Statistics with R by Peter Dalgaard (2008)Data Manipulation with R by Phil Spector (2008)

These references provide many technical details and examples regarding the use of R. The emphasishere is on the more general principles that facilitate its use, such as using R in conjunction with Exceland how to most efficiently use R with a text editor as well as a general overview.

3 Using R

3.1 R input and output

To use R is to call a function. Enter a call to a function in R after the provided command prompt, > .If a function call is extended to a new line with an Enter or Return before completion, the R promptchanges to +.

Some functions generate text at the console, the same window for which to enter R instructions. Otherfunctions generate graphics in a graphics window, and some functions generate output to both windows.Output in any window can be saved to a file at any time with the usual File B Save menu sequence,or, if specified, written directly to a file. The information in a graphics windows can be saved as a pdf

file or as bit-mapped file in one of several different formats.

3.2 Comments

R comments begin with a #. An entire line can serve as a comment with a # in the first column. Or,part of a line can serve as a comment. Whatever follows the # entered somewhere on the line is the

David Gerbing November 2, 2010

Page 9: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

3.3 Worksheet Apps and R 5

comment.

3.3 Worksheet Apps and R

Often the majority of work that underlies data analysis is the organizing and cleaning of the data forsubsequent statistical analysis. As opposed to most classroom data sets, real life data analysis is oftenmessy, with the initial data file far from ready for analysis. Viewing the data, manipulating the data,culling unreadable or nonsense data values, removing or correcting mis-formatted data values, andother such tasks are readily accomplished with a worksheet application such as MS Excel, OpenOfficeCalc, or Apple’s Numbers.

Unlike the more intuitive worksheet application, data manipulation in R requires programming, some-times rather esoteric programming. Fortunately, a worksheet application and R are not mutuallyexclusive tools for accomplishing data analysis, but instead complementary tools. One plausible strat-egy is to leverage the data manipulation abilities of Excel etc. with the vastly superior statisticalcapabilities of R.

The key is that any data file that a worksheet can read can be easily transferred to R. First removeall formatting from the data table stored in the worksheet, such as $ signs and commas in numbers,by converting all data values to the General format. Then do a Save As a .csv file. Read the resulting.csv file into R with the lessR rad function. R can read files of virtually any type, but .csv files are auniversal standard, and rad reduces the task to a single function call.

Data can also easily be moved from R back to a worksheet with the lessR out function. This functionwrites the current data table in R as a .csv file. Any .csv file can be read by Excel, Calc or virtuallyany other application that can read data.

3.4 Variables in a data frame

The generic variable in each function reference described in the following table is Y, and also X whenanother variable is referenced. Replace each of these generic references with the actual variable name,either a standalone variable or data vector, or, usually, a variable from the relevant R data table, whatR calls a data frame. In a data frame the data values are arranged in columns such that each columnbegins with the name of the variable, which is then followed by the corresponding data values.

3.5 Read the data into a data frame

This data frame can be named something such as mydata, as automatically provided by the lessR rad

function for reading the data in .csv format. If the data file to be read can be accessed by the local filesystem, then the file can be located by browsing the file system with a call to rad() with no arguments.

rad()

David Gerbing November 2, 2010

Page 10: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

3.6 Assignment operator 6

Or, put the path name or full web address of the .csv data file within quotes inside the parentheses.

rad("http://web.pdx.edu/~gerbing/data/employees.csv")

The contents of any R object, such as a data frame, can be listed at the console simply by entering thename of the object.

> mydata

The contents of any R object in terms of its structure are revealed by the str function. When appliedto a data frame, the output consists of one line per variable, which specified the type of variable andsome data values. To make sure that R specifies enough digits for numerical values, explicitly specifythe number of digits with the digits.d option.

> str(mydata, digits.d=15)

Also, there is a function called fix for making changes in a data frame within the R system. Invokefix with the name of the data frame, such as mydata.

> fix(mydata)

3.6 Assignment operator

The R assignment operator is <-, which assigns everything on its right side to whatever is on its leftside. This example assigns the value of 5 to the object named a.

> a <- 5

R supports many potential types of objects, including variables and data frames. The data frame is theobject that contains a standard data table, the form in which data are typically presented for standarddata analysis.

3.7 Combine function

R combines a set of values into a list or vector with the c function, such as c(2,4,6,8). Whenevera function requires the specification of a list of values, the list must always be expressed with the c

function. An example is the xlim option that specifies the lower and upper limits of the x-axis in aplot. To specify limits of 0 and 100, enter the xlim=c(0,100) option into the relevant function call,such as the following for the standard R histogram function.

> hist(Y, xlim=c(0,100))

Or, apply the same syntax to the lessR color.hist function, which provides default colors and otheradvantages.

> color.hist(Y, xlim=c(0,100))

David Gerbing November 2, 2010

Page 11: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

3.8 Editing an R statement on the command line 7

The c function is quite general. Here assign a list of the names of four people to the object mylist.

> mylist <- c("Mary", "Allison", "Eric", "Lance")

The combine function also facilitates R’s capability for vector arithmetic. Below are three lines of Rinput, followed by an output line. The [1] at the beginning of the line of output indicates that thefirst element present on that output line is the first element of the entire data display, for which thereare only three values.

> x <- c(2,4,6)> y <- x + 10> y[1] 12 14 16

3.8 Editing an R statement on the command line

After an R instruction is entered and run, the instruction be re-run and/or edited without retyping.After running the command by entering and then pushing the Enter or Return key, push the ↑ key,which causes the statement to reappear as if re-entered from the keyboard. Then, if desired, use the← and → keys, or the mouse, to move to a specific part of the R statement. Then edit the statementbefore pressing Enter or Return.

3.9 Options and decreased reliance on scientific notation

R tends to tends to report numerical results with many decimal digits in terms of scientific notation.For example, in hypothesis testing a critical result is the p-value, a probability value that is comparedagainst a criterion such as α = 0.05. In situations in which the p-value is close to a value of 0.0, Rmight report a value such as p-value < 2.2e-16.

To lessen R’s reliance on scientific notation, invoke the scipen option for the options function, whichcan override default options for scientific notation and many other values. Set scipen to a specificvalue, the larger the value, the less the tendency to rely upon scientific notation. A value of 50 prettymuch directs R to abandon entirely, unless there were more than 50 significant digits.

> options(scipen=50)

R now reports the previously reported p-value as: p-value < 0.00000000000000022.

There is little sense in this situation of trying to retain 16 significant digits, regardless if scientificnotation or regular notation, but that is what R insists. A value of 0.0000 is all the user needs to knowto understand that, indeed, the p-value is much less than α. When using regular notation, R still doesnot abandon these unnecessary digits, but the scientific notation is gone.

All list of all the current values of the options can also be obtained.

> options()

David Gerbing November 2, 2010

Page 12: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

3.10 Case sensitivity 8

3.10 Case sensitivity

R is case sensitive. To correctly specify a name, it must not only be spelled correctly, the pattern oflower and upper case letters must also match. For example, x does not refer to an object named X.

3.11 Subsetting

A data table consists of rows of data and columns of variables. Subsetting can occur for either or both.Subsetting rows results in only some rows being retained, and subsetting columns results in only somevariables being retained.

As an example, consider a data matrix with numeric variables Y, X1, X2, X3, X4, X5 and one categoricalvariable, Gender.

To refer only to data for males, subset by rows, selecting just the rows of data in which the value ofGender is Male. The double equal sign, ==, indicates the logical test for equality.

> mynewdata <- subset(mydata, Gender=="Male")

Or, to refer to data just for variables Y, X1, X2, X3 and X5, then subset by columns, indicated by theselect option. The notation X1:X3 refers to columns starting with X1 and ending with X3, so columnX2 is also included in this specification.

> mynewdata <- subset(mydata, select=c(Y,X1:X3,X5))

Equivalently, just drop variables X4 and Gender from the reference. Indicate dropping one or morevariables, instead of including the variables, with the − sign after the select=.

> mynewdata <- subset(mydata, select=-c(X4,Gender))

In addition, simultaneously subset rows and columns of the data matrix, or data frame in R terminology.When there is no explicit option name provided for the row subset, always list the rows to be deletedbefore the select option.

> mynewdata <- subset(mydata, Gender=="Male", select=c(Y,X1:X3,X5))

Also, multiple conditions can be invoked for which to subset rows. Now refer to only those rows forMales and in which also the value of Y is less than 100, and also refer only to the specified columns.

> mynewdata <- subset(mydata, Gender=="Male" & Y<100, select=c(Y,X1:X3,X5))

R also provides the more general square bracket operator for reference to only part of a data table oreven a single value of a variable, such as Y[2] for the second value of the data vector Y. Here the useof the square bracket operator produces subsets.

> mynewdata <- mydata[Gender=="Male"]

David Gerbing November 2, 2010

Page 13: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

9

This instruction yields the same result as the first subset example above. Only data for which Genderis equal to Male are included in the new data frame, here called mynewdata.

4 Missing Values

4.1 Assign Missing Values

R represents a missing value with a NA, an abbreviation for “not available”. If a value is actuallymissing, but there is some other value in its place, then assign the value with NA. For example, if Y isa vector of numbers, and the second value is missing, then assign as follows.

Y[2] <- NA

Change all occurrences of 9 throughout the entire data set called mydata to missing values.

mydata[mydata=="9"] <- NA

Make sure no Y exceed the range from -4 to 4.

Y[Y < -4 | Y > 4] <- NA

4.2 Ignore Missing Values

Just one missing data value NA for a variable typically results in the entire computation of a statisticfor that variable returning NA. If, instead, the na.rm = TRUE option is invoked in the computation ofthe statistic, then missing values are ignored in the computation. This example is for the computationof the standard deviation of Y.

sd(Y, na.rm = TRUE)

Similarly, the function na.omit removes all the rows of a data frame that contain any missing values.

cleaned.data <- na.omit(mydata)

5 Graphics

R presents many graphics options that can yield beautiful graphs. The system works by presentingfunctions at two different levels, a lower more atomic level and a higher, more user-friendly level. Usingstandard R functions such as plot, a scatterplot of variables X and Y can be easily produced by justinvoking the following.

> plot(X,Y)

David Gerbing November 2, 2010

Page 14: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

5.1 Labels 10

The plot function works by sequentially calling lower-level functions such as points and axes. Mostusers never need bother with these lower-level calls, but then, the ultimate customization of the graphis lost if these lower-level functions are not directly called.

The lessR functions introduced here provide yet a higher level set of function calls. For example, thelessR function color.hist relies up the standard R hist function for the histogram calculations, butalso upon the lower-level graphics functions as well, providing the user direct access to these functionsby default settings and specified options for background color, etc.

Many graphics functions and options are available throughout the R system, including lessR. Some ofthese, such as labels for the graph, are presented next. A full range of available options can be exploredvia the help files for the functions plot, par, axes and points.

5.1 Labels

When generating a graph, such as a histogram, R often provides reasonable default labels. To providecustom labels, use the following options for virtually all R graphics commands.

main="my main title "

xlab="my label for the x or horizontal axis "

ylab="my label for the y or vertical axis "

5.2 Colors

Colors can be added to most graphs with the col option. For example, a histogram with bars coloredwith lightsteelblue is generated from the stand-alone variable Y, which consists of 100 simulateddata values drawn from a normal distribution with mean of 50 and a standard deviation of 10.

> Y <- rnorm(100, mean=50, sd=10)> hist(Y, main="Histogram for Y", xlab="Y (in thousands)", col="lightsteelblue")

Or, just use the less R color.hist function for complete control over the color of the graph.

To see a list of all the named colors available for graphical displays, enter colors(). Customized colorscan be added with the rgb function. To see a list of all the color names, a sample of the color, and thecorresponding rgb values, use the color.show function from the lessR package.

> color.show()

David Gerbing November 2, 2010

Page 15: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

5.3 Write Graphic Directly to a pdf File 11

5.3 Write Graphic Directly to a pdf File

Alternatively, for graphics output, save each image as a graphics file, such as in pdf format with thestandard File → Save menu sequence. Or, have R output the graphic directly with the pdf function.

> pdf(file="whatever.pdf")...graphics function call(s)

> dev.off()

Invoking the pdf function should be matched with the closing dev.off function. When an externalfile is opened with a function such as pdf, R writes to this file for each successive graphics functioncall that is provided. The dev.off function informs R that there are no more graphics instructionsfor this file and to actually now create the graph and close the file. Until the file is closed, it may notbe viewed because it has not yet been closed. The file will also automatically be closed when the Rsession ends if it is still open.

The specific width and height of the graphics file can also be specified, such as the following.

> pdf(file="whatever.pdf", width=6, height=4)

The default unit of length is inches. The default width and height are 7 inches.

The default background of the created file is transparent. When viewing in the typical pdf viewer,such as Adobe Reader, the background will display as white. Fortunately, this white is only a visualeffect, so that when the file is inserted in to a slide show with a colored background, the backgroundwill continue to display “behind” the inserted file.

To obtain a true white background, add the bg="white" option to the pdf statement. Of course anyother color for the background can also be specified. Similarly, change the default foreground colorfrom black to whatever is desired, such as with fg="darkblue".

5.4 Write Graphic Directly to a Bitmap Format

Bitmapped graphic formats other than pdf are also available, including the jpeg and png functions.These other formats are invoked identically to the pdf function. Just substitute the name of thedifferent format. Default units for these bit mapped formats are pixels and the default backgroundis white as transparency is not supported. Bitmapped images, unlike pdf images, cannot be resizedwithout distortion.

5.5 Location of the Graphic File

A graphic file written to the file system is created in the R working directory. To identify this directory,invoke the getwd function, with no arguments.

> getwd()

David Gerbing November 2, 2010

Page 16: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

5.6 Insert PDF Image into MS Word 12

On Windows, the default working directly is usually the Documents folder. On the Mac, the defaultworking directory is set by default at root, that is, at the top directory level of the computer.

5.6 Insert PDF Image into MS Word

Ultimately, the results of the statistical analysis are gathered together and interpreted in a report. Themedium for this report is usually a word processing document, such as with MS Word, which, however,does not recognize pdf files as image files. Here is how to insert a pdf image stored in its own filedirectly into a MS Word file.

MS Word on the Mac:Insert → Picture → From File. . . .

MS Word on Windows:Insert → Object → From File. . . . and then Create from File

6 lessR Features

The implementation of the desired conceptual principles for statistical analysis software is via thefunctions in the lessR contributed package. Examples of the implementation of each principle follow.

6.1 Fewer functions, more analysis

Comparing two distributions is much more than obtaining a p-value from the test of the null hypothesisof equal population means. A complete analysis includes at least the following analyses.

1. Descriptive statistics of Y for each group

2. Evaluate assumption: Normality of Y for a group when n < 30

3. Evaluate assumption: Homogeneity of variance

4. Standardized mean difference, Cohen’s d

5. Confidence interval of the mean difference

6. Confidence interval of the standardized mean difference

7. Comparison of the distributions overall by histogram or estimated density distributions of Y foreach group

The lessR smd.t.test function supplants the standard R t.test function by providing all of theseanalyses by default. Both smd.t.test and t.test share the same syntax, though unlike t.test, thesmd.t.test function invokes the traditional assumption of homogeneity of variance.

David Gerbing November 2, 2010

Page 17: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

6.2 Foster best practice 13

Similarily, for a regression analysis, besides the estimated model coeffiecients, their inferential analysisand fit statistics, also desired are an outlier analysis and the prediction interval for each row of data.The lessR reg function provides these additional analyses with one function call.

6.2 Foster best practice

For any regression analysis done in any context, it is difficult to understand why identifying the ob-servations that have the largest influence on the model estimates would not be of interest. Certainlythe computational demands on any modern computer to calculate the values of influence statisticssuch as the Studentized Residual or Cook’s Distance are so minimal as to be irrelevant for all but thelargest data sets. R provides these influence statistics, but only after additional function calls, and thenprovides their values in an inconvenient format. The values of the influence statistics for each row ofdata are not listed in conjunction with the values of the predictor variables and response variable, and,further, they data are not sorted according to the influence statistics. And, a row of output is displayedfor every row of data, even if there are 2000 rows of data and only the small number of observationswith high influence statistics are of interest.

The lessR reg function by default provides the value of Cook’s Distance for each observation (rowof data), with an option for the Studentized Residual. By default the data are sorted by Cook’sDistance, and the data values for each observation are juxtaposed with the influence statistics. Bydefault, only the values for the 25 observations with the largest influence statistic are listed. Again,all of these defaults can be overridden, such as by specifying options such as res.sort="rstudent" orres.sort="off", pred.sort="off" or res.rows=10 to list only the 10 rows of data with the largestinfluence statistic.

One of the primary reasons for regression analysis is to generate a model to predict future values of theresponse variable. Unfortunately, the resulting forecasting or prediction error is larger than the usuallyprovided standard error estimate because it involves not only random variation about the regressionline, but also sampling error, which is not accounted for by the standard error of estimate. Again,unless there was no interest in using a model to forecast, these prediction errors should be of interest,and again, are provided by reg by default. Further, to enhance the usefulness of the output, thedata are sorted by the lower bound of the corresponding prediction interval. And, if there is only onepredictor variable in the model, a color scatterplot is automatically generated with the fitted regressionline and the confidence and prediction intervals.

Or, regarding the t-test, its underlying assumptions should be evaluated as part of the usual output.The standardized mean difference and its associated confidence interval should be routinely reported.And, a comparison of group means may perhaps be generally embedded in the larger framework of thecomparison between distributions. Compare the distributions by plotting their respective histogramson the same plot, or, better yet, their respective estimated density plots, as is done by default with thelessR smd.t.test function.

David Gerbing November 2, 2010

Page 18: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

6.3 Meaningful diagnostics 14

6.3 Meaningful diagnostics

A histogram will not be constructed unless there is a bin for every data point. This error could occurby the user simply specifing the start and end points for the bins that exclude some data values. Morecommonly, this error may occur even when the user specifies bin start and end points beyond the range,but the corresponding bin interval does not have the last bin ending at the specified endpoint. Thatis, this version of the error can occur when the bins do not evenly divide the continuum from startingand ending points for the bins.

An example of the R input and output follows for some 100 randomly simulated data values from anormal population with µ = 50 and σ = 10, analyzed with the standard R hist function.

> Y <- rnorm(100,50,10)

> range(Y)

[1] 24.37018 83.14326

> hist(Y,breaks=seq(22,85,4))

Error in hist.default(Y, breaks = seq(22, 85, 4)) :

some ’x’ not counted; maybe ’breaks’ do not span range of ’x’

This generated R error typically raises a few questions of interest for the new R user, including theintroductory student:

◦ What is hist.default?

◦ What is x?

◦ What does the word “span” mean in this context?

◦ What am I supposed to do now?

The lessR color.hist function traps the error before trying to generate the histogram from its owninternal hist function call. Then, if the error is detected, color.hist generates a more meaningfuldiagnostic and stops.

> color.hist(Y,breaks=seq(22,85,4))

Range of the data: 24.37018 to 83.14326

Bin Cutpoints: 22 26 ... 78 82

Data values too large to fit in the bins: 83.14326

Each data value must be in a bin.

To fix this problem, extend the bin range above 82.

Error: Try again with an extended bin range.

David Gerbing November 2, 2010

Page 19: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

6.4 Polymorphism 15

6.4 Polymorphism

The lessR function color.plot generates a variety of plots by simply specifing the variable name inthe function call.

> color.plot(Y)

Different default values are chosen for different circumstances of the specified plot and data values. Thegoal is to produce a desired graph from simply relying upon the default values, both for the color.plotfunction itself, as well as the base R functions called by color.plot, such as plot. Familiarity withthe options permits complete control over the computed defaults, but this familiarity becomes optionalif the default values are accepted.

When two variables are specified to plot, if the values of the first variable, x are unsorted, a scatterplotis produced. Sorted values of the first of the two specified variables yield a function plot for a smoothplotted line or curve.

Specifying just one variable leads to a run chart, with the values on the horizontal axis automaticallygenerated. The default is the Index variable, the ordinal position of each data value. Or, dates on thehorizontal axis can be specified from the specified starting date given by x.start and the accompanyingincrement as given by x.by. If the data values randomly vary about the mean, the default is to plotthe mean as the center line of the graph, otherwise the default is to ignore the center line. By defaultboth points and the corresponding connected line segments are plotted. The size of the points isautomatically reduced according to the number of points of points plotted. If the area below theplotted values is specified to be filled in with color, then by default each pair of adjacent points areconnected with a line segment.

6.5 Color

The standard R function for a histogram, hist, directly allows the user to color the bars and barborders, by overriding the default of transparent bars. Also, by employing lower-level options fromthe plot function – col.axis, col.lab and col.main – other colors on the plot can be specified.Providing gridlines and a background color for the histogram requires lower-level programming beyondjust a call to the hist function.

The lessR color.plot function provides, by default, colored bars, gridlines and a background color,and each can be respecified by their respective options, col for the bars and col.grid and col.bg.The col.axis, col.lab and col.main options can also be employed to change these correspondingcolor characteristics.

The same general principles apply to the lessR functions color.plot and color.density. Completedefault color schemes are provided, with each aspect either overridden by a parameter defined by therespective function, or by more basic parameters from the R system overall.

David Gerbing November 2, 2010

Page 20: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

16

7 Development Environments

R works just fine by entering lines of instructions, line by line, into the console in response to the Rcommand prompt, especially when editing with the arrow keys and mouse.

However, a more efficient method for working with R is to enter the desired instructions into a texteditor and then either copy and paste the resulting code into the R console, or use the source functionto access the file of R instructions. That way the code can be edited at will and easily stored. Note thata text editor is usually a better choice than a word processor for working with code because a wordprocessor changes things around to “help”. For example, a word processor will typically substitutecurly quotes for straight quotes. That works fine for a letter to your favorite Aunt, but R does notunderstand curly quotes.

Even better, have the code in the text editor run in R with a key press, and also have syntax coloring,to separate keywords and quoted strings from the rest of the text.

7.1 Macintosh

7.1.1 Recommended Free Text Editor: TextWrangler

BareBones software provides a capable Macintosh text editor, TextWrangler, at no cost, useful whenevera text editor is needed, not just for developing R programs. TextWrangler has far more features thandoes the free text editor provided by Apple with every Macintosh, TextEdit, and very ably edits Rprograms, particularly with the additional enhancements discussed below. Find TextWrangler at

http://www.barebones.com/products/textwrangler/download.html

7.1.2 Coordinating R files with TextWrangler

In the R preferences panel, set the desired text editor to TextWrangler. Just click on the Editor iconunder Source Editor. Then click the New Editor button and browse for the TextWrangler application.Now, when within R, when selecting the File menu and the New Document option, a TextWranglerwindow opens.

When saving an R file, use the filetype of .r. Tell your Macintosh that files with this filetype shouldopen up in TextWrangler when double-clicked. To do this, right click on an R file’s icon in the Finderand select Get Info. Now, under Open With: specify that all files with the filetype of .r should openin TextWrangler.app.

David Gerbing November 2, 2010

Page 21: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

7.1 Macintosh 17

7.1.3 Automatic R Code Execution from within TextWrangler

Use AppleScript to set automatic code execution from within TextWrangler. When automatic codeexecution is working, just select some text and press the specified key combination, such as Command-R, and the selected text automatically executes within R. No need for copying code from TextWranglerand then pasting into R.

To make automatic code execution happen, copy the following AppleScript code into the AppleScripteditor called Script Editor.app, found in the AppleScript folder in the Applications folder. Then do aSave As script in the Scripts folder, which is located as follows.

~/Library/Application Support/TextWrangler/Scripts

The ~ indicates your home directory in the Users folder. Startup TextWrangler and run the script fromthe script menu towards the end of the menu selections. Any selected code automatically runs in R.

--

-- This script can be used to make TextWrangler execute commands in R.

-- It executes the current selection.

-- If no text is selected it executes the current line.

-- Jean Thioulouse - Nov. 2008 - Jean.Thioulouse_at_univ-lyon1.fr

--

tell application "TextWrangler"

set the_selection to (selection of front window as string)

if (the_selection) is "" then

set the_selection to line (get startLine of selection) of front window as string

end if

end tell

tell application "R"

cmd the_selection

end tell

The script will also run after you assign a special key combination. To do this, under the TextWranglerWindow menu, choose the Palettes selection and then Scripts. At that point, assign the key combina-tion. From here forward, select some R code in Text Wrangler, press the chosen key combination, suchas Command-R, and immediately R processes the code.

7.1.4 Syntax Coloring

Editing code is much more effective and easier when various types of words, such as language keywords,character strings and comments are colored differently. TextWrangler will apply R syntax coloring whenthe appropriate file is added to the following directory.

~/Library/Application Support/TextWrangler/Language Modules/

David Gerbing November 2, 2010

Page 22: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

7.2 Linux 18

This file, which I use, is available at http://www.smalltime.com/gene/R.plist

Apparently an updated file is available at

http://homepages.nyu.edu/~jmb736/code/R_language_module_for_BBEdit/R.plist

7.2 Linux

The standard, free editor for the GNOME desktop is gedit. A plugin for this editor called RGedit

provides syntax coloring and also automatic code execution, that is, without needing to copy the codewritten in gedit to the R console. Even the R console output is directed to a new special windowwithin the gedit environment. The plugin is available at the following web location.

http://sourceforge.net/projects/rgedit/

7.3 Windows

There likely is something here, but do not know personally. Maybe File B New Script.

8 R Statistical Functions for Basic Analysis

The functions in the stats and base packages listed in the following table are part of the defaultR installation and are automatically loaded for each R session. Functions in other packages can beaccessed only when the corresponding contributed package is downloaded and loaded for each R session,as described on Page 2.

The reference in the following table to “Sec, Slide” refers to where the functions are illustrated in moredetail for my supporting materials, slide shows and corresponding videos.

David Gerbing November 2, 2010

Page 23: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

19

Purpose Sec, Slide Description Example Package

read 1.1, 31 browse to read .csv formatted datainto an R data frame called mydata

> rad() lessR

read 1.1, 31 read specified .csv data file > rad("path name or URL ") lessRcount 1.2, 6 count each of the values of a variable > table(Y) statsbar plot 1.2, 7,9 barplot from a table of counts > barplot(table(Y)) graphicspie chart 1.2, 8 pie chart from a table of counts > pie(table(Y)) graphicshistogram 1.2 15-17,21 color histogram with frequency table > color.hist(Y) lessRcumul hist 1.2, 30,31 use with color.hist cumul option

to generate a cumulative histogram> color.hist(Y,cumul="both") lessR

density normal and/or general curve density > color.density(Y) lessRbox plot box plot, grouping var X is optional > boxplot(Y ~ X) graphicsmin, max 1.2, 20 minimum and maximum value > min(Y) > max(Y) statsrange 1.2, 20 min and max values of a variable > range(Y) statssequence 1.2, 21 generate a sequence of values > seq(start,stop,by) basepareto chart 1.2, 36 pareto chart > pareto.chart(table(Y)) qccn 2.1, 9 number of data values for variable Y > length(Y) basen number of rows of data in a data

frame, the sample size> nrow(mydata) base

n vars number of columns of data in a dataframe, the number of variables

> ncol(mydata) base

mean 2.1, 3 arithmetic average or mean > mean(Y) basewt mean 2.1, 4 weighted mean, where wts is a vector

of weights, one for each value of Y> weighted.mean(Y,wts) stats

trim mean 2.1, 5 trimmed mean, p% trim on each side > mean(Y, trim=p)

median 2.1, 6 median > median(Y) statsstand dev 2.1, 29 standard deviation > sd(Y) statsstandardize standardize the data > scale(Y)

simulation simulate a sample of n data valuesfrom a normal distribution

> rnorm(n,mean,sd) stats

simulation simulate a sample of n values from abinomial distribution with size tri-als each with p probability of success

> rbinom(n,size,p) stats

probability normal curve probabilities; lower tailby default or set lower.tail=FALSE

> pnorm(quantile,mean,sd) stats

probability normal curve probs between specifiedvalues of lo and hi with graph

> prob.norm(lo,hi) lessR

probability t-distribution probabilities; lower tailby default or set lower.tail=FALSE

> pt(t,df) stats

quantile normal curve quantiles > qnorm(prob) statsquantile t-distribution quantiles > qt(prob,df) statsquantile quantiles of data values > quantile(Y) statsIQR interquartile range of data values > IQR(Y) statssample generate a random sample from a

data vector, or replace Y with 1:n

> sample(Y, size=n) stats

David Gerbing November 2, 2010

Page 24: R/lessR Reference for Basic Statistical Analysisweb.pdx.edu/~newsomj/qig/R_Reference.pdfThe recent revolution in statistical computing, in development since the mid-1990’s, is the

20

Purpose Sec, Slide Description Example Package

scatterplot scatterplot with colors, determinedwhen variable X is not sorted

> color.plot(X, Y) lessR

function plot of a function with colors, deter-mined when variable X is sorted

> color.plot(X, Y) lessR

run chart run chart, determined when only asingle variable is referenced

> color.plot(Y) lessR

control chrt various types of control charts > qcc(...) qcct-test independent groups t-test from data

with density plots and the standard-ized mean difference graphed

> smd.t.test(Y ~ X) lessR

t-test independent groups t-test from de-scriptive statistics

> stats.t.test(...) lessR

power power analysis with color powercurve for independent groups t-test

> powercurve.t.test(...) lessR

ANOVA one-way analysis of variance > oneway.test(Y ~ X) statsmeans plot plot marginal means from ANOVA > plotmeans(Y ~ X) gplotspost-hoc post-hoc comparison of marginal

means from an ANOVA> TukeyHSD(...) stats

factorial balanced two-way ANOVA > aov(Y ~ X1 * X2) statsregression extensive regression analysis with

correlations, residuals and related in-dices, and prediction errors

> reg(Y ~ X1 + X2) lessR

best subsets regression analysis of different setsof predictor variables to identify thebest subset for this data set only

> leaps(...) leaps

scatterplots scatterplot matrix of all pairs of twovariables in a data table

> pairs(mydata)

correlation correlations of all numeric variablesin a data frame, such as mydata

> cor(mydata) stats

cor hyp test correlation between two variableswith the hypothesis test of no pop-ulation correlation

> cor.test(X,Y) stats

correlation complete correlation matrix with thecorresponding matrices of samplesizes and p-values

> corr.test(mydata) psych

covariance covariance between two variables > cov(X,Y) statsellipse scatterplot with confidence ellipse > data.ellipse(X,Y) carpivot table 2D pivot or cross-tabulation table > table(X,Y) statsχ2 test chi-square test on pivot table > chisq.test(table(X,Y)) statstbl margins add the margins, the sums, to the

pivot table> addmargins(table(X,Y)) stats

David Gerbing November 2, 2010