40
News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue where I serve as Editor-in-Chief. This year has already seen a lot of exciting activities in the R project, one of them was the Distributed Sta- tistical Computing Workshop that took place here in Vienna. DSC 2003 is also partly responsible for the rather late release of this newsletter, both members of the editorial board and authors of regular columns were also involved in conference organization. R 1.7.0 was released on 2003-04-16, see “Changes in R” for detailed release information. Two major new features of this release are support for name spaces in packages and stronger support of S4-style classes and methods. While the methods package by John Chambers provides R with S4 classes for some time now, R 1.7.0 is the first version of R where the methods package is attached by default every time R is started. I know of several users who were rather anxious about the possible side effects of R going fur- ther in the direction of S4, but the transitions seems to have gone very well (and we were not flooded by bug reports). Two articles by Luke Tierney and Doug Bates discuss name spaces and converting packages to S4, respectively. In another article Jun Yan and Tony Rossini ex- plain how Microsoft Windows versions of R and R packages can be cross-compiled on Linux machines. Changing my hat from R News editor to CRAN maintainer, I want to take this opportunity to an- nounce Uwe Ligges as new maintainer of the win- dows packages on CRAN. Until now these have been built by Brian Ripley, who has spent a lot of time and effort on the task. Thanks a lot, Brian! Also in this issue, Greg Warnes introduces the ge- netics package, Jürgen Groß discusses the compu- tation of variance inflation factors, Thomas Lumley shows how to analyze survey data in R, and Carson, Murison and Mason use parallel computing on a Be- owulf cluster to speed up gene-shaving. We also have a new column with book reviews. 2002 has seen the publication of several books explic- itly written for R (or R and S-Plus) users. We hope to make book reviews a regular column of this newslet- ter. If you know of a book that should be reviewed, please tell the publisher to send a free review copy to a member of the editorial board (who will forward it to a referee). Last, but not least, we feature a recreational page in this newsletter. Barry Rowlingson has written a crossword on R, and even offers a prize for the cor- rect solution. In summary: We hope you will enjoy reading R News 3/1. Friedrich Leisch Technische Universität Wien, Austria [email protected] Contents of this issue: Editorial ...................... 1 Name Space Management for R ......... 2 Converting Packages to S4 ............ 6 The genetics Package ............... 9 Variance Inflation Factors ............ 13 Building Microsoft Windows Versions of R and R packages under Intel Linux ...... 15 Analysing Survey Data in R ........... 17 Computational Gains Using RPVM on a Be- owulf Cluster .................. 21 R Help Desk .................... 26 Book Reviews ................... 28 Changes in R 1.7.0 ................. 31 Changes on CRAN ................ 36 Crossword ..................... 39 Recent Events ................... 40

Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

NewsThe Newsletter of the R Project Volume 31 June 2003

Editorialby Friedrich Leisch

Welcome to the first issue of R News in 2003 whichis also the first issue where I serve as Editor-in-ChiefThis year has already seen a lot of exciting activitiesin the R project one of them was the Distributed Sta-tistical Computing Workshop that took place here inVienna DSC 2003 is also partly responsible for therather late release of this newsletter both membersof the editorial board and authors of regular columnswere also involved in conference organization

R 170 was released on 2003-04-16 see ldquoChangesin Rrdquo for detailed release information Two majornew features of this release are support for namespaces in packages and stronger support of S4-styleclasses and methods While the methods package byJohn Chambers provides R with S4 classes for sometime now R 170 is the first version of R where themethods package is attached by default every timeR is started I know of several users who were ratheranxious about the possible side effects of R going fur-ther in the direction of S4 but the transitions seemsto have gone very well (and we were not flooded bybug reports) Two articles by Luke Tierney and DougBates discuss name spaces and converting packagesto S4 respectively

In another article Jun Yan and Tony Rossini ex-plain how Microsoft Windows versions of R and Rpackages can be cross-compiled on Linux machines

Changing my hat from R News editor to CRANmaintainer I want to take this opportunity to an-nounce Uwe Ligges as new maintainer of the win-dows packages on CRAN Until now these have beenbuilt by Brian Ripley who has spent a lot of time andeffort on the task Thanks a lot Brian

Also in this issue Greg Warnes introduces the ge-netics package Juumlrgen Groszlig discusses the compu-tation of variance inflation factors Thomas Lumleyshows how to analyze survey data in R and CarsonMurison and Mason use parallel computing on a Be-owulf cluster to speed up gene-shaving

We also have a new column with book reviews2002 has seen the publication of several books explic-itly written for R (or R and S-Plus) users We hope tomake book reviews a regular column of this newslet-ter If you know of a book that should be reviewedplease tell the publisher to send a free review copy toa member of the editorial board (who will forward itto a referee)

Last but not least we feature a recreational pagein this newsletter Barry Rowlingson has written acrossword on R and even offers a prize for the cor-rect solution In summary We hope you will enjoyreading R News 31

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

Contents of this issue

Editorial 1Name Space Management for R 2Converting Packages to S4 6The genetics Package 9Variance Inflation Factors 13Building Microsoft Windows Versions of R

and R packages under Intel Linux 15

Analysing Survey Data in R 17Computational Gains Using RPVM on a Be-

owulf Cluster 21R Help Desk 26Book Reviews 28Changes in R 170 31Changes on CRAN 36Crossword 39Recent Events 40

Vol 31 June 2003 2

Name Space Management for RLuke Tierney

Introduction

In recent years R has become a major platform for thedevelopment of new data analysis software As thecomplexity of software developed in R and the num-ber of available packages that might be used in con-junction increase the potential for conflicts amongdefinitions increases as well The name space mech-anism added in R 170 provides a framework for ad-dressing this issue

Name spaces allow package authors to controlwhich definitions provided by a package are visibleto a package user and which ones are private andonly available for internal use By default definitionsare private a definition is made public by exportingthe name of the defined variable This allows pack-age authors to create their main functions as com-positions of smaller utility functions that can be in-dependently developed and tested Only the mainfunctions are exported the utility functions used inthe implementation remain private An incrementaldevelopment strategy like this is often recommendedfor high level languages like Rmdasha guideline oftenused is that a function that is too large to fit into asingle editor window can probably be broken downinto smaller units This strategy has been hard to fol-low in developing R packages since it would havelead to packages containing many utility functionsthat complicate the userrsquos view of the main function-ality of a package and that may conflict with defini-tions in other packages

Name spaces also give the package author ex-plicit control over the definitions of global variablesused in package code For example a package thatdefines a function mydnorm as

mydnorm lt- function(z)

1sqrt(2 pi) exp(- z^2 2)

most likely intends exp sqrt and pi to referto the definitions provided by the base packageHowever standard packages define their functionsin the global environment shown in Figure 1

packagebase

GlobalEnv

packagectest

Figure 1 Dynamic global environment

The global environment is dynamic in the sense

that top level assignments and calls to library andattach can insert shadowing definitions ahead of thedefinitions in base If the assignment pi lt- 3 hasbeen made at top level or in an attached packagethen the value returned by mydnorm(0) is not likelyto be the value the author intended Name spacesensure that references to global variables in the basepackage cannot be shadowed by definitions in thedynamic global environment

Some packages also build on functionality pro-vided by other packages For example the packagemodreg uses the function asstepfun from the step-fun package The traditional approach for dealingwith this is to use calls to require or library to addthe additional packages to the search path This hastwo disadvantages First it is again subject to shad-owing Second the fact that the implementation of apackage A uses package B does not mean that userswho add A to their search path are interested in hav-ing B on their search path as well Name spaces pro-vide a mechanism for specifying package dependen-cies by importing exported variables from other pack-ages Formally importing variables form other pack-ages again ensures that these cannot be shadowed bydefinitions in the dynamic global environment

From a userrsquos point of view packages with aname space are much like any other package A callto library is used to load the package and place itsexported variables in the search path If the packageimports definitions from other packages then thesepackages will be loaded but they will not be placedon the search path as a result of this load

Adding a name space to a package

A package is given a name space by placing alsquoNAMESPACErsquo file at the top level of the packagesource directory This file contains name space direc-tives that define the name space This declarativeapproach to specifying name space information al-lows tools that collect package information to extractthe package interface and dependencies without pro-cessing the package source code Using a separatefile also has the benefit of making it easy to add aname space to and remove a name space from a pack-age

The main name space directives are export andimport directives The export directive specifiesnames of variables to export For example the di-rective

export(asstepfun ecdf isstepfun stepfun)

exports four variables Quoting is optional for stan-dard names such as these but non-standard namesneed to be quoted as in

R News ISSN 1609-3631

Vol 31 June 2003 3

export([lt-tracker)

As a convenience exports can also be specified witha pattern The directive

exportPattern(test$)

exports all variables ending with testImport directives are used to import definitions

from other packages with name spaces The direc-tive

import(mva)

imports all exported variables from the packagemva The directive

importFrom(stepfun asstepfun)

imports only the asstepfun function from the step-fun package

Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded

For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam

Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)

Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it

is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help

Name spaces and method dispatch

R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170

S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform

S3method(print dendrogram)

in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly

S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom

A simple example

A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is

x lt- 1

f lt- function(y) c(xy)

R News ISSN 1609-3631

Vol 31 June 2003 4

The lsquoNAMESPACErsquo file contains the single directive

export(f)

Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package

The second package bar has code file lsquobarRrsquo con-taining the definitions

c lt- function() sum()

g lt- function(y) f(c(y 7))

and lsquoNAMESPACErsquo file

import(foo)

export(g)

The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar

Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces

gt g(6)

[1] 1 13

This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base

Some details

This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves

Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains

Dyn

amic

packagebase

GlobalEnv

packagectest

internal defs

base

imports

Stat

ic

Figure 2 Name space environment

imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base

When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B

The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions

The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a

R News ISSN 1609-3631

Vol 31 June 2003 5

Loaded Name Spaces

internal defs

base

imports

A exports

packageB

packageA

GlobalEnv

packagebase

Global Search Path

internal defs

base

imports

C exports

internal defs

base

imports

B exports

Figure 3 Internal implementation

name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development

Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already

Discussion

Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches

for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code

The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible

Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear

Bibliography

John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5

Samuel P Harbison Modula-3 Prentice Hall 1992 5

PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5

Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3

Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5

Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5

Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu

R News ISSN 1609-3631

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 2: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 2

Name Space Management for RLuke Tierney

Introduction

In recent years R has become a major platform for thedevelopment of new data analysis software As thecomplexity of software developed in R and the num-ber of available packages that might be used in con-junction increase the potential for conflicts amongdefinitions increases as well The name space mech-anism added in R 170 provides a framework for ad-dressing this issue

Name spaces allow package authors to controlwhich definitions provided by a package are visibleto a package user and which ones are private andonly available for internal use By default definitionsare private a definition is made public by exportingthe name of the defined variable This allows pack-age authors to create their main functions as com-positions of smaller utility functions that can be in-dependently developed and tested Only the mainfunctions are exported the utility functions used inthe implementation remain private An incrementaldevelopment strategy like this is often recommendedfor high level languages like Rmdasha guideline oftenused is that a function that is too large to fit into asingle editor window can probably be broken downinto smaller units This strategy has been hard to fol-low in developing R packages since it would havelead to packages containing many utility functionsthat complicate the userrsquos view of the main function-ality of a package and that may conflict with defini-tions in other packages

Name spaces also give the package author ex-plicit control over the definitions of global variablesused in package code For example a package thatdefines a function mydnorm as

mydnorm lt- function(z)

1sqrt(2 pi) exp(- z^2 2)

most likely intends exp sqrt and pi to referto the definitions provided by the base packageHowever standard packages define their functionsin the global environment shown in Figure 1

packagebase

GlobalEnv

packagectest

Figure 1 Dynamic global environment

The global environment is dynamic in the sense

that top level assignments and calls to library andattach can insert shadowing definitions ahead of thedefinitions in base If the assignment pi lt- 3 hasbeen made at top level or in an attached packagethen the value returned by mydnorm(0) is not likelyto be the value the author intended Name spacesensure that references to global variables in the basepackage cannot be shadowed by definitions in thedynamic global environment

Some packages also build on functionality pro-vided by other packages For example the packagemodreg uses the function asstepfun from the step-fun package The traditional approach for dealingwith this is to use calls to require or library to addthe additional packages to the search path This hastwo disadvantages First it is again subject to shad-owing Second the fact that the implementation of apackage A uses package B does not mean that userswho add A to their search path are interested in hav-ing B on their search path as well Name spaces pro-vide a mechanism for specifying package dependen-cies by importing exported variables from other pack-ages Formally importing variables form other pack-ages again ensures that these cannot be shadowed bydefinitions in the dynamic global environment

From a userrsquos point of view packages with aname space are much like any other package A callto library is used to load the package and place itsexported variables in the search path If the packageimports definitions from other packages then thesepackages will be loaded but they will not be placedon the search path as a result of this load

Adding a name space to a package

A package is given a name space by placing alsquoNAMESPACErsquo file at the top level of the packagesource directory This file contains name space direc-tives that define the name space This declarativeapproach to specifying name space information al-lows tools that collect package information to extractthe package interface and dependencies without pro-cessing the package source code Using a separatefile also has the benefit of making it easy to add aname space to and remove a name space from a pack-age

The main name space directives are export andimport directives The export directive specifiesnames of variables to export For example the di-rective

export(asstepfun ecdf isstepfun stepfun)

exports four variables Quoting is optional for stan-dard names such as these but non-standard namesneed to be quoted as in

R News ISSN 1609-3631

Vol 31 June 2003 3

export([lt-tracker)

As a convenience exports can also be specified witha pattern The directive

exportPattern(test$)

exports all variables ending with testImport directives are used to import definitions

from other packages with name spaces The direc-tive

import(mva)

imports all exported variables from the packagemva The directive

importFrom(stepfun asstepfun)

imports only the asstepfun function from the step-fun package

Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded

For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam

Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)

Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it

is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help

Name spaces and method dispatch

R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170

S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform

S3method(print dendrogram)

in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly

S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom

A simple example

A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is

x lt- 1

f lt- function(y) c(xy)

R News ISSN 1609-3631

Vol 31 June 2003 4

The lsquoNAMESPACErsquo file contains the single directive

export(f)

Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package

The second package bar has code file lsquobarRrsquo con-taining the definitions

c lt- function() sum()

g lt- function(y) f(c(y 7))

and lsquoNAMESPACErsquo file

import(foo)

export(g)

The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar

Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces

gt g(6)

[1] 1 13

This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base

Some details

This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves

Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains

Dyn

amic

packagebase

GlobalEnv

packagectest

internal defs

base

imports

Stat

ic

Figure 2 Name space environment

imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base

When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B

The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions

The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a

R News ISSN 1609-3631

Vol 31 June 2003 5

Loaded Name Spaces

internal defs

base

imports

A exports

packageB

packageA

GlobalEnv

packagebase

Global Search Path

internal defs

base

imports

C exports

internal defs

base

imports

B exports

Figure 3 Internal implementation

name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development

Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already

Discussion

Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches

for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code

The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible

Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear

Bibliography

John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5

Samuel P Harbison Modula-3 Prentice Hall 1992 5

PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5

Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3

Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5

Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5

Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu

R News ISSN 1609-3631

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 3: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 3

export([lt-tracker)

As a convenience exports can also be specified witha pattern The directive

exportPattern(test$)

exports all variables ending with testImport directives are used to import definitions

from other packages with name spaces The direc-tive

import(mva)

imports all exported variables from the packagemva The directive

importFrom(stepfun asstepfun)

imports only the asstepfun function from the step-fun package

Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded

For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam

Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)

Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it

is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help

Name spaces and method dispatch

R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170

S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform

S3method(print dendrogram)

in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly

S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom

A simple example

A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is

x lt- 1

f lt- function(y) c(xy)

R News ISSN 1609-3631

Vol 31 June 2003 4

The lsquoNAMESPACErsquo file contains the single directive

export(f)

Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package

The second package bar has code file lsquobarRrsquo con-taining the definitions

c lt- function() sum()

g lt- function(y) f(c(y 7))

and lsquoNAMESPACErsquo file

import(foo)

export(g)

The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar

Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces

gt g(6)

[1] 1 13

This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base

Some details

This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves

Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains

Dyn

amic

packagebase

GlobalEnv

packagectest

internal defs

base

imports

Stat

ic

Figure 2 Name space environment

imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base

When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B

The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions

The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a

R News ISSN 1609-3631

Vol 31 June 2003 5

Loaded Name Spaces

internal defs

base

imports

A exports

packageB

packageA

GlobalEnv

packagebase

Global Search Path

internal defs

base

imports

C exports

internal defs

base

imports

B exports

Figure 3 Internal implementation

name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development

Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already

Discussion

Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches

for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code

The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible

Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear

Bibliography

John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5

Samuel P Harbison Modula-3 Prentice Hall 1992 5

PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5

Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3

Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5

Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5

Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu

R News ISSN 1609-3631

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 4: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 4

The lsquoNAMESPACErsquo file contains the single directive

export(f)

Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package

The second package bar has code file lsquobarRrsquo con-taining the definitions

c lt- function() sum()

g lt- function(y) f(c(y 7))

and lsquoNAMESPACErsquo file

import(foo)

export(g)

The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar

Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces

gt g(6)

[1] 1 13

This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base

Some details

This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves

Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains

Dyn

amic

packagebase

GlobalEnv

packagectest

internal defs

base

imports

Stat

ic

Figure 2 Name space environment

imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base

When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B

The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions

The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a

R News ISSN 1609-3631

Vol 31 June 2003 5

Loaded Name Spaces

internal defs

base

imports

A exports

packageB

packageA

GlobalEnv

packagebase

Global Search Path

internal defs

base

imports

C exports

internal defs

base

imports

B exports

Figure 3 Internal implementation

name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development

Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already

Discussion

Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches

for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code

The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible

Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear

Bibliography

John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5

Samuel P Harbison Modula-3 Prentice Hall 1992 5

PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5

Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3

Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5

Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5

Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu

R News ISSN 1609-3631

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 5: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 5

Loaded Name Spaces

internal defs

base

imports

A exports

packageB

packageA

GlobalEnv

packagebase

Global Search Path

internal defs

base

imports

C exports

internal defs

base

imports

B exports

Figure 3 Internal implementation

name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development

Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already

Discussion

Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches

for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code

The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible

Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear

Bibliography

John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5

Samuel P Harbison Modula-3 Prentice Hall 1992 5

PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5

Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3

Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5

Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5

Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu

R News ISSN 1609-3631

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 6: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 6

Converting Packages to S4by Douglas Bates

Introduction

R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations

lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods

Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)

The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session

There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4

Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including

conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so

S3 versus S4 classes and methods

S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names

The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary

Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this

There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned

By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class

R News ISSN 1609-3631

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 7: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 7

is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class

S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature

S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)

In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package

Package conversion

Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4

S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic

Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we

had more slots in the S4 classes than components inthe S3 objects

Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully

Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed

We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned

We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language

The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We

R News ISSN 1609-3631

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 8: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 8

would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version

Pitfalls

S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods

We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition

Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments

Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed

One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-

wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound

Conclusions

Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win

S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7

J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6

R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6

W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7

Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu

R News ISSN 1609-3631

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 9: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 9

The genetics PackageUtilities for handling genetic data

by Gregory R Warnes

Introduction

In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile

Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo

Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)

A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent

The genetics package

The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-

types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions

The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency

My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations

bull A single vector with a character separator

g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo

NArsquoAArsquorsquoACrsquorsquoACrsquo) )

g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo

rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)

sep=rsquo rsquo removespaces=F)

bull A single vector with a positional separator

g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo

rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )

bull Two separate vectors

g4 lt- genotype(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)

)

bull A dataframe or matrix with two columns

R News ISSN 1609-3631

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 10: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 10

gm lt- cbind(

c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)

c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )

g5 lt- genotype( gm )

For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)

A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package

categorical Each allele combination acts differently

This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor

lm( outcome ~ genotypevar + confounder )

additive The effect depends on the number of copiesof a specific allele (0 1 or 2)

The function allelecount( gene allele )returns the number of copies of the specifiedallele

lm( outcome ~ allelecount(genotypevarrsquoArsquo)

+ confounder )

dominantrecessive The effect depends only on thepresence or absence of a specific allele

The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent

lm( outcome ~ carrier(genotypevarrsquoArsquo)

+ confounder )

Implementation

The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient

After considering several potential implemen-tations we chose to implement the genotype

class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead

The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table

allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]

where which is 1 2 or c(12) as appropriateFinally there is often additional meta-

information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent

This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way

The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include

allele Extracts individual alleles matrix

allelenames Extracts the set of allele names

homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame

heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ

carrier Creates a logical vector indicating whetherthe specified alleles are present

R News ISSN 1609-3631

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 11: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 11

allelecount Returns the number of copies of thespecified alleles carried by each observation

getlocus Extracts locus gene or marker informa-tion

makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes

writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages

writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)

The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium

diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium

HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium

HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers

HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium

as well as three related to linkage disequilibrium(LD)

LD Compute pairwise linkage disequilibrium be-tween genetic markers

LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance

LDplot Plot linkage disequilibrium as a function ofmarker location

and one function for sample size calculation

gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize

The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation

Example

Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome

gt library(genetics)

[]

gt Load the data from a CSV file

gt data lt- readcsv(example_datacsv)

gt

gt Convert genotype columns to genotype variables

gt data lt- makeGenotypes(data)

gt

gt Annotate the genes

gt marker(data$a1691g) lt-

+ marker(name=A1691G

+ type=SNP

+ locusname=MBP2

+ chromosome=9

+ arm=q

+ indexstart=35

+ bpstart=1691

+ relativeto=intron 1)

[]

gt

gt Look at some of the data

gt data[15]

PID DELTABMI c104t a1691g c2249t

1 1127409 062 CC GG TT

2 246311 131 CC AA TT

3 295185 015 CC GG TT

4 34301 072 CT AA TT

5 96890 037 CC AA TT

gt

gt Get allele information for c104t

gt summary(data$c104t)

Marker MBP2C-104T (9q35-104) Type SNP

Allele Frequency

Count Proportion

C 137 068

T 63 032

Genotype Frequency

Count Proportion

CC 59 059

CT 19 019

TT 22 022

gt

gt

gt Check Hardy-Weinberg Equilibrium

gt HWEtest(data$c104t)

-----------------------------------

Test for Hardy-Weinberg-Equilibrium

-----------------------------------

Call

R News ISSN 1609-3631

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 12: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 12

HWEtestgenotype(x = data$c104t)

Raw Disequlibrium for each allele pair (D)

C T

C 012

T 012

Scaled Disequlibrium for each allele pair (Drsquo)

C T

C 056

T 056

Correlation coefficient for each allele pair (r)

C T

C 100 -056

T -056 100

Overall Values

Value

D 012

Drsquo 056

r -056

Confidence intervals computed via bootstrap

using 1000 samples

Observed 95 CI NArsquos

Overall D 0121 ( 0073 0159) 0

Overall Drsquo 0560 ( 0373 0714) 0

Overall r -0560 (-0714 -0373) 0

Contains Zero

Overall D NO

Overall Drsquo NO

Overall r NO

Significance Test

Exact Test for Hardy-Weinberg Equilibrium

data data$c104t

N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2

= 63 p-value = 3463e-08

gt

gt Check Linkage Disequilibrium

gt ld lt- LD(data)

Warning message

Non-genotype variables or genotype variables with

more or less than two alleles detected These

variables will be omitted PID DELTABMI

in LDdataframe(data)

gt ld text display

Pairwise LD

-----------

a1691g c2249t

c104t D -001 -003

c104t Drsquo 005 100

c104t Corr -003 -021

c104t X^2 016 851

c104t P-value 069 00035

c104t n 100 100

a1691g D -001

a1691g Drsquo 031

a1691g Corr -008

a1691g X^2 130

a1691g P-value 025

a1691g n 100

gt

gt LDtable(ld) graphical display

Marker 2

Mar

ker

1

a1691g c2249t

c104

ta1

691g

minus00487(100)

068780

minus09978(100)

000354

minus03076(100)

025439

(minus)Drsquo(N)

Pminusvalue

Linkage Disequilibrium

gt fit a model

gt summary(lm( DELTABMI ~

+ homozygote(c104trsquoCrsquo) +

+ allelecount(a1691g rsquoGrsquo) +

+ c2249t data=data))

Call

lm(formula = DELTABMI ~ homozygote(c104t C) +

allelecount(a1691g G) + c2249t

data = data)

Residuals

Min 1Q Median 3Q Max

-29818 -05917 -00303 06666 27101

Coefficients

Estimate Std Error

(Intercept) -01807 05996

homozygote(c104t C)TRUE 10203 02290

allelecount(a1691g G) -00905 01175

c2249tTC 04291 06873

c2249tTT 03476 05848

t value Pr(gt|t|)

(Intercept) -030 076

R News ISSN 1609-3631

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 13: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 13

homozygote(c104t C)TRUE 446 23e-05

allelecount(a1691g G) -077 044

c2249tTC 062 053

c2249tTT 059 055

---

Signif codes 0 lsquorsquo 0001 lsquorsquo 001

lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1

Residual standard error 11 on 95 degrees of

freedom

Multiple R-Squared 0176

Adjusted R-squared 0141

F-statistic 506 on 4 and 95 DF

p-value 0000969

Conclusion

The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a

variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-

rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal

In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots

I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion

I welcome comments and contributions

Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom

Variance Inflation Factorsby Juumlrgen Groszlig

A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model

Centered variance inflation factors

A popular method in the classical linear regressionmodel

y = Xntimesp

β +ε X = (1n x2 xp) ε sim (0σ2 In)

to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when

(a) the nonconstant columns x2 xp of X hadbeen centered and

(b) the centered columns had been orthogonal toeach other

This variance is given by

v j =σ2

xprimejCx j C = In minus

1n

1n1primen

where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is

VIF j =var(β j)

v j= σminus2 var(β j)xprimejCx j j = 2 p

It can also be written in the form

VIF j =1

1minus R2j

where

R2j = 1minus

xprimej M jx j

xprimejCx j M j = In minus X j(X prime

jX j)minus1X primej

is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept

The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list

gt vif lt- function(object )

UseMethod(vif)

R News ISSN 1609-3631

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 14: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 14

gt vifdefault lt- function(object )

stop(No default method for vif Sorry)

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

else

v1 lt- diag(V)

v2 lt- diag(Vi)

warning(paste(No intercept term

detected Results may surprise))

structure(v1 v2 names = nam)

Uncentered variance inflation fac-tors

As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample

gt n lt- 50

gt x2 lt- 5 + rnorm(n 0 01)

gt x3 lt- 1n

gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)

gt x5 lt- (x3100)^5 100

gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +

rnorm(n 0 50)

gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)

Then centered variance inflation factors obtainedvia viflm(cllm) are

x2 x3 x4 x5

1017126 6774374 4329862 3122899

Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity

No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653

As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects

Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination

R2j = 1minus

xprimej M jx j

xprimejx j

instead of the centered R2j This can also be done for

the intercept as an independent lsquovariablersquo For theabove example we obtain

(Intercept) x2 x3 x4 x5

2540378 2510568 2792701 2489071 4518498

as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case

It should be noted that xprimejCx j le xprimejx j so that

always R2j le R2

j Hence the usual critical value

R2j gt 09 (corresponding to centered VIF gt 10) for

harmful collinearity should be greater when the un-centered VIFs are regarded

Conclusion

When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity

By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model

gt viflm lt- function(object )

V lt- summary(object)$covunscaled

Vi lt- crossprod(modelmatrix(object))

nam lt- names(coef(object))

k lt- match((Intercept) nam

nomatch = FALSE)

v1 lt- diag(V)

v2 lt- diag(Vi)

ucstruct lt- structure(v1 v2 names = nam)

if(k)

v1 lt- diag(V)[-k]

v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]

nam lt- nam[-k]

cstruct lt- structure(v1 v2 names = nam)

return(CenteredVIF = cstruct

UncenteredVIF = ucstruct)

R News ISSN 1609-3631

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 15: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 15

else

warning(paste(No intercept term

detected Uncentered VIFs computed))

return(UncenteredVIF = ucstruct)

The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs

Bibliography

D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14

J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13

J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13

W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13

Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde

Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool

by Jun Yan and AJ Rossini

It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools

These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170

For the impatient and trusting if a current ver-sion of Linux R is in the search path then

make CrossCompileBuild

will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe

this in detail in the section on Building R Packages andBundles

Setting up the build area

We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names

Create work area

The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described

bull downloads is the location to store all the sourcesneeded for cross-building

bull cross-tools is the location to unpack thecross-building tools

bull WinR is the location to unpack the R source andcross-build R

bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R

R News ISSN 1609-3631

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 16: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 16

bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build

bull WinRlibs is the location to put the resultingWindows binaries of the packages

These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile

Download tools and sources

make down

This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)

Cross-tools setup

make xtools

This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory

Prepare source code

make prepsrc

This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them

Build Linux R if needed

make LinuxR

This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR

Building R

Configuring

If a current Linux R is available from the system andon your search path then run the following com-mand

make mkrules

This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include

If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section

make LinuxFresh=YES mkrules

This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above

Compiling

Now we can cross-compile R

make R

The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory

This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN

Building R packages and bundles

Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows

Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into

We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do

R News ISSN 1609-3631

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 17: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 17

make pkg-geepack_02-4

If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation

We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do

make bundle-VR_70-11

The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory

This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name

The Makefile

The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does

The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually

Possible pitfalls

We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though

Acknowledgments

We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions

Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu

AJ RossiniUniversity of Washington USArossiniuwashingtonedu

Analysing Survey Data in Rby Thomas Lumley

Introduction

Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control

The basic question of survey statistics is

If we did exactly this analysis on thewhole population what result would weget

If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot

To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then

individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town

For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities

Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled

The resulting sample may be very different in

R News ISSN 1609-3631

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 18: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 18

structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis

Weighted estimation

The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is

sumi

1πi

Yi

regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation

It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample

Standard errors

The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling

The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)

If we index strata by s clusters by c and observa-tions by i then

ˆvar[S] = sums

ns

ns minus 1 sumci

(1

πsci

(Ysci minus Ys

))2

where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s

Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions

Subsetting surveys

Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own

The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance

Example

The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude

The survey package

In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is

R News ISSN 1609-3631

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 19: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 19

imdsgnlt-svydesign(id=~PSUstrata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters

In fact there are further levels of clustering and itwould be possible to write

imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX

strata=~STRATUM

weights=~WTFAIM

data=immunize

nest=TRUE)

to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights

The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum

Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions

The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading

Analyses

Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command

svytable(~AGEP+POLCTC design=imdsgn)

produces a table of number of polio immunisationsby age in years for children with good immunisation

data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do

gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)

pmin(3 POLCTC)

AGEP 0 1 2 3

0 473079 411177 655211 276013

1 162985 72498 365445 863515

2 144000 84519 126126 1057654

3 89108 68925 110523 1123405

4 133902 111098 128061 1069026

5 137122 31668 19027 1121521

6 127487 38406 51318 838825

where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none

To examine whether this varies by city size wecould tabulate

svytable(~AGEP+pmin(3POLCTC)+MSASIZEP

design=imdsgn)

where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful

Regression models

The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment

To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation

General weighted likelihood estimation

The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by

R News ISSN 1609-3631

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 20: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 20

svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)

svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)

Figure 1 Logistic regression models for vaccination status

gr lt- function(xmeansdlog)

dm lt- 2(x - mean)(2sd^2)

ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))

cbind(dmds)

m2 lt- svymle(loglike=dnormgradient=gr design=dxi

formulas=list(mean=y~x+z sd=~1)

start=list(c(250) c(4))

log=TRUE)

Figure 2 Weighted maximum likelihood estimation

formulas Exactly one of these formulas should havea left-hand side specifying the response variable

It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design

Extending the survey package

It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone

1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)

2 Evaluate the call

3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui

4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance

5 Add a svywhatever class to the object and acopy of the design object

6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)

7 Override any likelihood extraction methods tofail

It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm

A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals

Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is

Summary

Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome

References

Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685

Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu

R News ISSN 1609-3631

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 21: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 21

Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason

Introduction

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity

Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially

The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing

A parallel computing environment

The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)

Gene-shaving

The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps

1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102

2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene

3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast

1 XlowastN There are [09N] genes in Xlowast

1 [081N] in Xlowast

2 and 1 in XlowastN

4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2

k minus Rlowast2k where Rlowast2

k is the mean R2

from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic

5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk

from eachrow of X

newX[i] = X[i] (I minus xSk(xT

SkxSk

)minus1 xTSk

)︸ ︷︷ ︸IPx

6 The next optimal cluster is found by repeatingthe process using newX

R News ISSN 1609-3631

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 22: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 22

Figure 1 Clustering of genes

The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes

Parallel implementation of gene-shaving

There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk

In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by

Speedup =1

rs + rp

n

where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like

The parallel setup

Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait

for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves

bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess

bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess

Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother

Slave 2Slave 1 Slave 3 Slave n

Master

Figure 2 The master-slave paradigm without slave-slave communication

Bootstrapping

Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it

The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2

bis derived and the mean of R2

b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks

Orthogonalization

The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm

R News ISSN 1609-3631

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 23: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 23

The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it

There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large

The RPVM code

Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM

Bootstrapping

Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic

The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children

parbootstrap lt- function(X)

brsq lt- 0

tasks lt- length(bootchildren)

PVMinitsend()

send a copy of X to everyone

PVMpkdblmat(X)

PVMmcast(bootchildren0)

get the R squared values back and

take the mean

for(i in 1tasks)

PVMrecv(bootchildren[i]0)

rsq lt- PVMupkdouble(11)

brsq lt- brsq + (rsqtasks)

return(brsq)

After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with

rsq lt- PVMupkdouble(11)

The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks

receive the matrix

PVMrecv(PVMparent()0)

X lt- PVMupkdblmat()

Xb lt- X

dd lt- dim(X)

perform the permutation

for(i in 1dd[1])

permute lt- sample(1dd[2]replace=F)

Xb[i] lt- X[ipermute]

send it back to the master

PVMinitsend()

PVMpkdouble(Dfunct(Xb))

PVMsend(PVMparent()0)

PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process

Orthogonalization

The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions

R News ISSN 1609-3631

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 24: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 24

back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess

parorth lt- function(IPx X)

rows lt- dim(X)[1]

tasks lt- length(orthchildren)

intialize the send buffer

PVMinitsend()

pack the IPx matrix and send it to

all children

PVMpkdblmat(IPx)

PVMmcast(orthchildren0)

divide the X matrix into blocks and

send a block to each child

nrows = asinteger(rowstasks)

for(i in 1tasks)

start lt- (nrows (i-1))+1

end lt- nrows i

if(end gt rows)

end lt- rows

Xpart lt- X[startend]

PVMpkdblmat(Xpart)

PVMsend(orthchildren[i]start)

wait for the blocks to be returned

and re-construct the matrix

for(i in 1tasks)

PVMrecv(orthchildren[i]0)

rZ lt- PVMupkdblmat()

if(i==1)

Z lt- rZ

else

Z lt- rbind(ZrZ)

dimnames(Z) lt- dimnames(X)

Z

The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter

wait for the IPx and X matrices to arrive

PVMrecv(PVMparent()0)

IPx lt- PVMupkdblmat()

PVMrecv(PVMparent()-1)

Xpart lt- PVMupkdblmat()

Z lt- Xpart

orthogonalize X

for(i in 1dim(Xpart)[1])

Z[i] lt- Xpart[i] IPx

send back the new matrix Z

PVMinitsend()

PVMpkdblmat(Z)

PVMsend(PVMparent()0)

Gene-shaving results

Serial gene-shaving

The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel

The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess

2000 4000 6000 8000 10000 12000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

orthogonalization

bootstrapping

geneminusshaving

Figure 3 Serial Gene-shaving results

Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime

We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as

R News ISSN 1609-3631

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 25: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 25

we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078

Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices

Parallel gene-shaving

The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm

Bootstrap Comparisons Orthogonalization Comparisons

2000 6000 10000

0

200

400

600

800

1000

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

2000 6000 10000

0

1

2

3

4

5

6

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

Overall Comparisons

2000 6000 10000

0

200

400

600

800

1000

1200

size of expression matrix (rows)

com

puta

tion

time

(sec

onds

)

serial2 nodes16 nodes

Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments

The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows

S =Tserial

Tparallel

S2nodes =13011882534

= 16

S16nodes =13011854959

= 24

The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here

Conclusions

Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented

Bibliography

Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22

Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21

Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21

Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21

Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21

R News ISSN 1609-3631

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 26: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 26

Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21

Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23

Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21

Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23

Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21

Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21

Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21

Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau

R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals

Uwe Ligges

Introduction

This issue of the R Help Desk deals with its probablymost fundamental topic Getting help

Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities

A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives

Manuals

The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core

Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages

I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR

Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2

Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights

For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core

1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml

R News ISSN 1609-3631

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 27: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 27

2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R

Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo

Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)

Help functions

Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo

When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions

Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of

keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword

Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched

Mailing lists

Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R

Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)

Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)

It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications

3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml

R News ISSN 1609-3631

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 28: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 28

Bibliography

Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26

R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26

R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26

R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26

R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26

Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27

Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27

Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26

Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde

Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus

John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp

The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level

The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-

ough index of more than 25 pages In the index allS-Plus language words appear in bold type

A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter

SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference

Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications

R News ISSN 1609-3631

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 29: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 29

are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)

Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site

In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques

Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom

Peter DalgaardIntroductory Statistics with R

Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml

This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered

As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers

at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study

The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression

10 Linear Models11 Logistic regression12 Survival analysis

plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch

Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described

Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances

Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-

R News ISSN 1609-3631

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 30: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 30

parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems

The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)

As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population

Andy LiawMerck Research Laboratoriesandy_liawmerckcom

John Fox An R and S-Plus Com-panion to Applied Regression

Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion

The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)

Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory

data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore

Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page

Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection

Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo

Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage

The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it

The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all

R News ISSN 1609-3631

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 31: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 31

The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-

strapping time series nonlinear or robust regres-sion

In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S

Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg

Changes in R 170by the R Core Team

User-visible changes

bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions

bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself

class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to

bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required

bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo

bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which

may have reordered the terms) the results arenow much closer to those of S

bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc

bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)

bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation

New features

bull if() and while() give a warning if called witha vector condition

bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)

bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used

bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)

bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem

bull Output text connections no longer have a line-length limit

R News ISSN 1609-3631

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 32: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 32

bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars

bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()

bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)

bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)

bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage

bull The postscript output makes use of relativemoves and so is somewhat more compact

bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms

bull arima() has coef() logLik() (and hence AIC)and vcov() methods

bull New function asdifftime() for time-intervaldata

bull basename() and dirname() are now vector-ized

bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)

bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string

bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts

bull chol() and chol2inv() now use LAPACK rou-tines by default

bull asdist() is now idempotent ie works fordist objects

bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)

bull New function constrOptim() for optimisationunder linear inequality constraints

bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)

bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication

bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called

bull The editmatrix() and editdataframe()editors can now handle logical data

bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)

bull New function filesymlink() to create sym-bolic file links where supported by the OS

bull New generic function flush() with a methodto flush connections

bull New function force() to force evaluation of aformal argument

bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces

bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values

bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip

load() now uses gzcon() so can read com-pressed saves from suitable connections

bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer

bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances

To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)

bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package

R News ISSN 1609-3631

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 33: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 33

bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899

bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)

bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes

bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)

bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library

bull listfiles()dir() have a new argument lsquore-cursiversquo

bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets

bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored

bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)

bull New function mapply() a multivariatelapply()

bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)

bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)

bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not

bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)

bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly

bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)

bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them

bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option

bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works

bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default

bull New plotdesign() function as in S

bull The postscript() and PDF() drivers now al-low the title to be set

bull New function poweranovatest() con-tributed by Claus Ekstroslashm

bull powerttest() now behaves correctly fornegative delta in the two-tailed case

bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)

bull prcomp() is now fast for nm inputs with m gtgtn

bull princomp() no longer allows the use of morevariables than units use prcomp() instead

bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used

bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)

bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does

bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels

R News ISSN 1609-3631

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 34: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 34

bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets

bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)

bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible

bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class

bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm

bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()

bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()

bull require() has a new argument namedcharacteronly to make it align with library

bull New functions rmultinom() and dmultinom()the first one with a C API

bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)

bull New function sliceindex() for identifyingindexes with respect to slices of an array

bull solvedefault(a) now gives the dimnamesone would expect

bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction

bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels

bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo

bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm

bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp

bull New function Sysgetpid() to get the processID of the R session

bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)

bull The tempfile() function now takes an op-tional second argument giving the directoryname

bull The ordering of terms for

termsformula(keeporder=FALSE)

is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true

bull added [ method for terms objects

bull New argument lsquosilentrsquo to try()

bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency

bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()

bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl

bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly

bull Changes to support name spaces

ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE

ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)

ndash Added name space support for packagesthat do not use methods

R News ISSN 1609-3631

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 35: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 35

bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing

bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for

Installation changes

bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages

bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77

bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound

bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found

bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default

bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable

bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines

bull Perl requirements are down again to 5004 ornewer

bull Autoconf 257 or later is required to build theconfigure script

bull Configure provides a more comprehensivesummary of its results

bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the

name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())

bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)

bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure

Deprecated amp defunct

bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage

If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error

bull machine() Machine() and Platform() are de-funct

bull restart() is defunct Use try() as has longbeen recommended

bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed

bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)

bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods

bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes

R News ISSN 1609-3631

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 36: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 36

Utilities

bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE

bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file

bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools

bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides

C-level facilities

bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS

bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity

bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information

bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used

bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused

bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details

bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot

Changes on CRANby Kurt Hornik and Friedrich Leisch

New contributed packages

Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin

GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani

SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in

R News ISSN 1609-3631

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 37: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 37

the book By Kjetil Halvorsen

abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger

ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel

amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas

anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad

climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad

dispmod Functions for modelling dispersion inGLM By Luca Scrucca

effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox

gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway

genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch

glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm

gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta

grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz

gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others

hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally

hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald

ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson

locfit Local Regression Likelihood and density esti-mation By Catherine Loader

mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard

mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley

multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas

normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo

plspcr Multivariate regression by PLS and PCR ByRon Wehrens

polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg

rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina

R News ISSN 1609-3631

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 38: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 38

session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes

snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li

statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth

survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley

vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik

New Omegahat packages

REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang

RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang

RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display

icon list file list and directory tree By DuncanTemple Lang

RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang

RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang

RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang

SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang

SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg

R News ISSN 1609-3631

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 39: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 39

Crosswordby Barry Rowlingson

Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but

most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details

Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk

1 2 3 4 5 6 7 8

9 10

11 12

13 14

15

16 17 18

19 20 21 22

23

24 25

26 27

Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)

Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac

R News ISSN 1609-3631

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003
Page 40: Editorial · News The Newsletter of the R Project Volume 3/1, June 2003 Editorial by Friedrich Leisch Welcome to the first issue of R News in 2003, which is also the first issue

Vol 31 June 2003 40

Recent EventsDSC 2003

Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation

Two events took place in Vienna just prior to DSC

2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening

While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants

Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo

Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende

Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria

Editorial BoardDouglas Bates and Thomas Lumley

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • Name Space Management for R
    • Introduction
    • Adding a name space to a package
    • Name spaces and method dispatch
    • A simple example
    • Some details
    • Discussion
      • Converting Packages to S4
        • Introduction
        • S3 versus S4 classes and methods
        • Package conversion
        • Pitfalls
        • Conclusions
          • The genetics Package
            • Introduction
            • The genetics package
            • Implementation
            • Example
            • Conclusion
              • Variance Inflation Factors
                • Centered variance inflation factors
                • Uncentered variance inflation factors
                • Conclusion
                  • Building Microsoft Windows Versions of R and R packages under Intel Linux
                    • Setting up the build area
                      • Create work area
                      • Download tools and sources
                      • Cross-tools setup
                      • Prepare source code
                      • Build Linux R if needed
                        • Building R
                          • Configuring
                          • Compiling
                            • Building R packages and bundles
                            • The Makefile
                            • Possible pitfalls
                            • Acknowledgments
                              • Analysing Survey Data in R
                                • Introduction
                                  • Weighted estimation
                                  • Standard errors
                                  • Subsetting surveys
                                    • Example
                                    • The survey package
                                      • Analyses
                                        • Extending the survey package
                                        • Summary
                                        • References
                                          • Computational Gains Using RPVM on a Beowulf Cluster
                                            • Introduction
                                            • A parallel computing environment
                                            • Gene-shaving
                                            • Parallel implementation of gene-shaving
                                              • The parallel setup
                                              • Bootstrapping
                                              • Orthogonalization
                                                • The RPVM code
                                                  • Bootstrapping
                                                  • Orthogonalization
                                                    • Gene-shaving results
                                                      • Serial gene-shaving
                                                      • Parallel gene-shaving
                                                        • Conclusions
                                                          • R Help Desk
                                                            • Introduction
                                                            • Manuals
                                                            • Help functions
                                                            • Mailing lists
                                                              • Book Reviews
                                                                • Michael Crawley Statistical Computing An Introduction to Data Analysis Using S-Plus
                                                                • Peter Dalgaard Introductory Statistics with R
                                                                • John Fox An R and S-Plus Companion to Applied Regression
                                                                  • Changes in R 170
                                                                  • Changes on CRAN
                                                                    • New contributed packages
                                                                    • New Omegahat packages
                                                                      • Crossword
                                                                      • Recent Events
                                                                        • DSC 2003